European Genome-Phenome Archive

File Quality

File InformationEGAF00000850014

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 525 1131 847 3051 647 6851 580 7791 562 4081 601 6601 697 4011 883 1352 158 1152 582 0413 185 5963 983 9295 051 6636 425 4378 088 96610 060 05012 321 31214 837 30617 584 92720 497 30523 495 27526 601 48829 761 05132 946 31036 195 73039 575 32743 036 65146 687 07750 484 57054 529 49958 796 26463 225 84367 858 78972 606 31277 396 52382 099 07386 603 44390 769 55894 552 05097 761 502100 291 976101 993 737102 790 899102 603 705101 367 29799 133 51395 866 58691 723 73086 783 90781 236 99275 170 81568 753 25862 162 94655 533 83849 046 58442 831 27236 951 67831 548 58826 648 34622 241 09518 396 02915 064 71112 222 0339 842 7827 863 5106 242 8384 937 8173 894 1383 067 9152 424 5211 929 2401 547 5381 259 0901 044 681883 207760 379672 227601 493548 899507 814475 964446 676425 721406 306388 298372 802360 757347 725336 731326 312317 268306 769296 532287 905278 338268 856261 259252 107243 644236 645227 778222 208215 975208 567202 121195 209190 040184 798178 567173 092168 151163 781158 840154 130148 743145 573141 299137 699134 236130 655128 182126 075123 937121 198118 746115 111113 629111 320109 091107 418106 315104 953102 470101 371100 03898 42096 64395 71993 80093 13990 69989 19988 12687 19185 57383 52683 15681 31380 25078 84577 23974 96973 19172 50571 22069 74768 28467 13264 98363 39962 61460 98560 06659 44357 31855 68954 56654 27153 20752 33451 03150 34749 81948 74247 69846 97945 89845 62444 58344 02843 14242 82042 01340 40939 87739 53139 31238 48338 21837 56937 20436 39635 91535 88035 20235 14834 39334 09033 86333 74933 32132 79632 18631 63431 51330 93230 68330 02529 40129 22128 63928 46928 16027 77127 35326 76226 51726 00425 87125 23725 00324 63824 54124 27623 86923 85023 72823 53823 33423 03123 04122 48222 28822 00021 91321 39321 23420 95120 40020 34720 08420 01419 70519 54519 41319 00218 81318 80818 36318 45318 31118 08617 83217 75217 52516 99516 59516 61416 52616 76816 16316 07915 90415 90115 59915 57115 49615 31015 02814 85914 45314 24713 92413 57613 35013 63413 24812 89012 99512 70512 43312 26012 22911 99311 93311 57511 48911 57211 13311 41511 35410 96210 94810 90310 75810 96210 72610 40510 50310 37110 2519 9789 8239 99110 0179 6789 5989 6779 4619 3089 2019 1219 2519 0788 9828 8798 8558 7308 7528 5318 2368 3038 2508 3518 2548 0258 0578 0407 9027 8417 7757 7987 7297 7437 5247 5287 5237 4427 4437 3577 2087 1417 1407 1497 0876 9436 8486 7546 6516 5976 7346 5456 5276 4786 4616 4406 3386 3296 2126 1566 1056 1056 1546 2135 9656 1525 9535 8705 8685 6945 7635 8335 6705 7585 5615 5315 4665 4835 4215 2375 4685 2845 2875 2855 1675 2185 1105 1115 0115 0875 0485 1205 2305 1265 1125 0975 0594 9395 0744 9324 9314 9004 7524 7234 8234 8974 7504 7714 7564 8194 7964 7474 5534 5724 4174 4924 3724 5754 4324 4634 3824 2694 3174 4064 2184 0464 2044 1904 0804 1394 0604 0984 0273 9453 9553 8333 8813 9233 8853 8353 8293 8403 7903 8743 7663 7653 7173 7453 6003 8383 8093 6683 7323 5963 7443 5483 6663 5773 6003 6553 5323 5963 4843 5713 4623 5823 4953 4173 3013 4513 3093 4343 3473 3013 2623 2393 2613 2293 3883 3363 2783 1753 2543 1093 2783 1403 2323 1293 1223 1123 0623 1373 1252 9973 0483 0282 9312 9792 9922 8762 9332 9222 9502 9362 9022 9562 8742 9052 8502 7952 8282 8132 7862 7532 7082 7062 6582 7282 6572 6002 6752 6072 6212 5202 6872 5332 3632 4602 3662 4532 4592 4362 3862 3872 3482 3952 3862 3312 3022 3682 3452 3312 3302 2062 2192 2612 2232 1512 2312 2172 1492 1692 0952 0612 1042 0741 9942 0342 1342 0762 0642 0312 1432 1112 0462 0912 0041 9762 0471 9621 9491 9321 8901 8121 9162 0121 9181 8391 8351 7771 7361 7811 6911 7821 7251 8101 7401 7321 7611 7051 6921 6461 6811 6601 6171 6031 6171 5951 6221 5611 5941 6201 6031 6021 6241 6501 5141 5421 5681 4981 5311 5421 4641 4201 5141 4801 5261 4531 5051 4881 4761 4931 4491 4871 4811 4781 5451 4731 4571 4221 3951 4111 3621 3221 3371 3551 3291 3181 3931 3761 3951 3661 3491 3541 3121 3561 2861 3371 2631 2621 2111 2401 2051 2841 1911 2331 2691 2641 2241 1701 2681 2101 2071 1771 1691 2631 2471 2051 1581 2001 2111 1521 1131 1761 1571 1921 2021 1341 1141 0811 1411 0471 1531 1291 1331 0821 1281 0231 0601 0641 1171 0821 0741 0771 0361 0451 0701 0639811 0271 0451 0539901 019985917965938971965953913960924963906930920924954903948978917888896920898868884887891898864831908865784812866933906895881849877862910850870822884790835864801877841875864820895856890811825810809796867830791786748728784754738728781816769756757804812851749784751788785828807878820821733799746786797802722760800825726836755790790778732766750789784749768746736737744743753723708698768743725692724714733779751773777743739704752785757730707679737682706671700691716633608687692631679694639725613654641621563689661678657619625625596691646637658653675702689658654609602634603594637653625577645582567583563639645603615601590591573595556593521604575599563570539555528572496523526526563556504536526534516522514536495526520553507524521547478484557544537519562514519489520492498534521502499434511454505470502450509525488473500446443515424466452413425448473420410425477470424467491423468444469462459415453585 899100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

27 450 0440000190 921 93864 122 963690 814 886355 077 968113 727 050284 343 118109 499 42078 287 298156 062 06346 571 854166 938 468110 839 410167 881 206366 477 217241 135 628328 594 643139 257 306251 656 024349 209 381472 769 317632 589 133869 717 061792 787 140644 292 7151 345 823 1271 625 445 9843 015 806 0202 251 137 7023 288 099 5709 243 925 49713 130 090 2247 192 624 54612 469 048 6008 962 938 35514 279 641 25618 655 203 37726 434 395 69100510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G22G24G26G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

95.3 %1 234 221 03895.3 %4.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

93.9 %1 217 013 56893.9 %6.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.4 %17 207 4701.4 %98.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %647 726 01650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

92.2 %1 193 824 83492.2 %7.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.7 %34 455 4202.7 %97.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

122 725 1115 833 478439 17512 723 720389 131180 121238 45792 70569 47155 99912 553 53654 13140 82229 48530 10124 83424 14625 83822 26530 11626 08530 19642 10635 68645 70937 56346 29460 56673 00385 607116 736131 072154 103219 510283 318418 582442 138542 081838 1751 000 965728 187959 300832 427793 8891 005 3211 403 09589 26174 28563 16256 66247 18342 42635 94429 98628 06627 05126 20225 33825 06526 8791 128 920 166051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.06%98.63%99.26%98.67%99.24%99.32%98.61%98.94%98.67%97.21%98.35%98.92%99.41%99.02%99.22%98.25%98.7%97.88%97.69%98.59%96.88%98.43%90.6%99.44%0.94%1.37%0.74%1.33%0.76%0.68%1.39%1.06%1.33%2.79%1.65%1.08%0.59%0.98%0.78%1.75%1.3%2.12%2.31%1.41%3.12%1.57%9.4%0.56%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped