European Genome-Phenome Archive

File Quality

File InformationEGAF00008413652

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 937 6772 423 2382 414 9622 742 4233 325 9494 319 2325 962 4678 665 97812 972 21019 521 92329 024 40041 952 56058 519 68178 366 948100 740 607124 178 772146 998 233167 425 942183 754 415194 427 473198 951 956197 188 788189 556 671176 992 911160 791 126142 271 560122 801 475103 578 34385 470 02268 952 96954 579 21542 385 05632 400 56024 354 72918 080 93513 266 3209 661 1867 005 4895 078 5603 700 4962 727 3492 049 4661 579 1761 252 9741 019 125856 311742 164659 071597 732548 090505 582471 975440 981415 012391 333370 436349 128332 568316 447303 655289 449275 951265 089252 306243 142231 763221 369212 068204 709197 125187 579178 792172 035164 339155 949149 944142 550136 994129 650125 007119 929115 134110 976107 493103 26799 38995 21992 89890 39887 57284 65382 11580 68978 36576 88874 23472 97271 91870 44569 62668 32967 54566 24964 44663 49362 14960 89159 66357 92557 31356 56155 19554 83853 40452 31951 12449 52148 61247 73246 26444 65044 27142 93742 46841 11940 47739 43638 61538 12136 85535 66834 45433 83832 91731 23330 96430 12129 15328 48927 29626 69625 83325 53525 02924 34123 31522 90422 17621 37620 94320 25520 03319 74319 26018 51118 38818 21118 00117 38816 88316 70216 34416 20315 92115 60715 26015 36115 10914 78014 44714 15813 79813 76913 28913 15712 71412 55712 31912 11811 93211 69311 74511 51211 38711 42911 18410 87010 66410 65610 41810 41610 22210 1229 8179 7599 7519 9539 5449 4939 2309 1698 9008 9399 0868 9238 6668 5788 3218 4038 3717 9337 9747 9537 9537 7437 6767 5237 3207 3487 3517 2817 2687 2697 1377 1687 3987 1347 0816 8466 8647 1006 8036 8616 7646 7066 4656 5716 5436 4266 3926 3156 3756 3966 2576 2316 0986 0415 9905 8755 8125 7225 7535 5845 7505 5905 5995 5295 3775 2665 1645 4085 3685 1945 1825 0485 1245 1445 1125 0525 0805 0054 9504 8954 6874 7854 7334 6104 6104 5314 5874 4594 5744 3684 3794 2834 3084 2324 2344 3134 1124 0614 0954 2524 1674 2124 0723 9414 1014 0673 9433 9143 9103 8743 8123 8273 6963 6903 9433 7353 9893 8423 7903 6883 5903 6603 5423 5643 4603 3553 3693 5043 4873 3603 3933 4883 4163 4653 3563 3183 1773 0693 1483 1463 0823 0542 9792 8832 8932 8222 8202 9132 8462 8182 8202 8872 9712 8212 7332 8892 7652 9632 7992 8192 8362 7772 7912 7962 7262 7192 7042 6182 7842 6512 6992 6412 5912 6662 5892 5482 6382 5722 4842 5752 5972 5222 5062 4612 5152 4202 4272 3512 4622 4312 3852 3472 3322 3482 4162 3912 3322 3552 2732 2732 2182 1872 1972 2712 1982 2082 1572 1452 1152 1152 1802 0582 1222 0992 1182 1992 1192 1302 1592 0492 0092 0502 0642 0322 1512 0332 0332 1062 1162 1402 0672 0892 0022 1011 9961 9161 9452 0152 0311 9741 9291 9281 9132 0021 8651 8241 8371 9271 7871 8531 7681 9251 7791 8431 8471 8471 8481 9001 7981 7911 7751 7471 7441 6291 6501 6841 6331 6551 6061 7401 7421 6501 6381 5651 5951 6761 6201 5741 5611 5191 4471 5901 4611 4851 5401 4391 4571 6121 5851 6401 5731 4581 6001 5991 5331 5601 5261 5661 5321 5101 5991 5881 4581 5131 5211 5011 5181 4581 4551 4971 4841 4921 4631 3621 3971 4231 3701 4161 4411 4191 4261 4371 4481 4351 4021 4681 4281 3691 4341 3641 4111 4881 3851 3821 3711 3581 4441 4171 3221 3691 3361 3891 3571 3261 3661 3651 3281 3291 3681 3701 3581 4061 3601 3271 3811 2931 2921 2811 2401 3721 2491 2571 2851 3021 2161 2451 2681 2661 2141 2761 2121 2491 2651 2451 1911 2341 2371 2051 2051 2501 2021 2341 2151 2331 2011 2511 2091 2701 2181 2431 2321 1691 1931 2611 1611 1681 2051 2121 1901 1721 1761 1811 1211 2371 1121 1311 1061 1941 1771 1661 1371 1711 1041 0541 0661 1281 0791 0841 0499949781 0211 0601 0711 0531 0611 0441 0081 0061 0441 0601 0539981 0549861 0131 0241 0821 0229949381 0349149579629788881 001896954911969966946917952922919905932942865904887869844883882904885885858827837908945934835842876830873856909916892896846842845859813840841780800783786827786768745754789742747781755703755760754756734716735735788770792742756713758759751715706756744681677713645714738652690702708671642659713724703673641621655619632643651693710629644609647598643580627642639598662609624575572579610529633599576620653603643601618607587604587539577542603570563600578567622563584559557537614575604615600573532578553617547554577584523542513578544585556568572607512500541533509536546541514495542517517560573572505540546536494484509537488460484483456467468506455477495537469447505475468501499486490472473489496457501435418475472445426461423488475459460460479466498504466443442475457467426478422484479466463452490449468488507496464500498474484446483457444467432433446448451434450399440438411409447445424423412463470449421444407429472422404401414420393455478449424455411442392416451435480450377371393402450382428418361401426404415408424393438412445447436416407437385455419421388346405421400405444379389368566 048100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

8 942 393000000081 799 5620002 587 285 6140000000001 502 280 82600001 701 791 59600003 828 083 94600008 361 834 44600048 835 349 87100510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %441 324 28699.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %440 694 07299.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %630 2140.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %221 547 57750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %434 247 45098 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.6 %33 844 7067.6 %92.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

18 644 467390 845238 146465 032350 486364 218406 518554 869246 673432 032202 911171 629243 021278 517143 917333 780221 702244 896333 233460 798475 287465 262590 205469 277744 8551 225 00970 7402 123 608107 556103 120207 189217 49299 956267 567110 453109 181174 848229 97963 765348 9405 241 974251 250237 453412 648351 985656 279567 108865 7641 418 404153 407206 316173 103230 621107 735190 682206 410141 258569 273147 489314 927400 041 873051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.86%99.87%99.87%99.85%99.88%99.86%99.85%99.85%99.72%99.89%0.14%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.14%0.13%0.13%0.15%0.12%0.14%0.15%0.15%0.28%0.11%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped