European Genome-Phenome Archive

File Quality

File InformationEGAF00001586640

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

110 170 81131 718 13713 413 5137 776 4455 240 5693 900 2313 093 4072 566 6752 203 2841 919 9921 715 3081 545 7531 407 4621 295 1551 195 7111 115 0231 042 443978 172924 239871 081829 972790 580751 625720 632686 211658 454631 878600 761577 809552 856531 867515 671495 047475 198458 107444 458425 154409 189399 986383 800370 266358 585345 767336 304328 088316 180308 303296 827289 688280 692271 573264 700257 749250 265245 361238 141231 136224 560219 766213 498208 252202 165197 348192 617188 306183 961179 936175 000170 822167 566162 564158 733155 124150 643147 325144 494141 250136 974134 850131 206129 051125 866121 337121 627118 540115 719113 530110 703107 481105 200102 293100 74899 35696 90295 83492 99292 35690 06587 44886 35685 07683 01281 72480 11878 61977 34775 76674 13773 18971 50270 38069 26267 51166 99765 52964 51063 41162 83961 56360 34159 08458 71257 52855 94955 68855 19153 87852 90651 73351 73050 53049 46148 57947 83047 20045 63145 02244 86444 57144 25743 20142 74542 10141 19340 60240 40440 09039 27038 82337 46137 65137 31436 49736 23635 82035 36534 80934 86834 34033 47233 20632 64432 45931 75131 35531 00130 68830 47029 43329 44528 68628 79027 90127 65326 86826 83927 17826 90525 93325 65325 46625 40724 84624 72124 41624 30323 57123 39623 63422 86822 70922 28022 14721 94421 76221 84821 09920 86320 51319 92520 13419 60719 38419 20019 27819 08818 66518 26618 19718 00117 92017 60517 45117 42217 30416 84116 68916 68216 43016 38316 03316 33416 22615 45915 23315 26115 35515 08614 85714 67814 73114 85014 15214 08113 66013 66113 74213 54913 42313 35513 16813 32713 06112 98213 07712 66812 69912 30912 28312 06812 22911 83012 09911 89511 92211 86611 27711 15311 16411 29811 10310 81010 45610 94710 99810 51410 21810 13010 27510 2059 9669 9549 6539 7169 6449 5229 7429 4629 4359 4309 2959 1139 1488 9529 1288 8898 5928 4348 6398 4578 4438 4158 0868 1877 8478 0197 8978 0447 7497 9827 4917 6077 5947 4727 3087 4087 2927 3087 1907 3987 0947 1527 2536 7536 8656 9376 8916 8316 7256 5206 5896 5036 5616 5266 5766 5416 5596 3716 2206 3556 1196 3246 3285 9586 0836 1435 8766 0545 8155 6015 7445 7315 6645 6175 7295 6605 4865 5115 3025 4975 4595 3325 5125 3395 3785 0785 2365 0365 2025 2315 0034 9485 0905 0285 0344 8895 0325 1285 0104 6734 8344 6794 7464 6514 7004 6374 6864 8814 6754 6544 7324 5904 4614 5404 4144 4134 5114 2914 4334 3244 1714 1844 3484 3394 1494 2514 2524 2144 0584 0414 0623 9543 8744 0224 1263 8543 8323 9584 0464 0533 9553 7793 7713 8363 8023 7403 9013 9193 8453 7543 6833 5843 6423 7253 6933 8213 4313 3643 2913 4473 5553 4503 3943 5733 3613 2263 1913 2473 2843 4493 3473 2463 2553 2823 2013 1663 1263 1673 2573 1243 1833 1403 1242 9552 9403 0553 0313 0312 9583 0202 8952 8333 1512 9902 9772 9722 9892 7012 8772 8602 8152 7962 8292 8252 7282 6822 6662 6762 7972 6602 6962 5722 5562 6752 6622 6492 6622 7272 5972 5942 6062 5462 5092 5922 4932 4472 5952 5442 6252 5302 4672 5282 3782 6062 3502 4122 4312 2842 3522 3002 3402 2132 3362 2492 3102 3462 3772 3692 4752 3552 1732 2732 2462 2632 1142 0642 1012 1632 0662 1772 1512 0402 0282 0852 0212 1331 9482 0562 1312 0092 0052 0282 0291 9691 9481 9641 8991 9201 9291 9431 9031 8431 9121 8611 8161 7691 8161 7921 7581 8971 7731 6691 8031 6981 7241 6281 6721 7471 7091 7291 7891 7091 6511 6861 6711 6691 6361 6991 6581 7311 6511 5891 7471 7091 6221 6331 6591 7041 6911 6561 6721 6211 5841 6241 6321 6151 5661 5321 6631 5071 5541 4971 5891 5161 4491 5351 6351 5111 5461 6061 5531 6321 4971 4021 4681 4321 4681 5061 3831 4511 5141 3891 4161 3681 4531 4621 4111 3811 4051 3591 3611 3901 3161 3561 3121 3561 2781 3271 2861 4131 3801 3321 3081 3231 3321 3121 3021 3811 2611 3161 2301 2511 3181 2761 2761 3201 2271 2781 3261 2071 2991 2321 1731 3011 3371 1901 2731 2871 2451 2011 2991 1961 1581 2251 2171 2321 1921 1541 1811 1901 1171 2531 2301 1821 1721 1171 1501 0971 1551 1301 1621 0741 0451 0431 1441 0811 0891 0921 1131 0571 0911 0381 0681 0961 1261 0451 1211 0971 0921 0141 0751 0141 0291 0329671 0919771 1001 0199219841 0629991 0089899439339959409369049879509659589479709019741 005928891870928964888940836934924898882856931895936927833893859851950840874840831908854851892888868882831870873849839812846789881869851762798887891802843843843908809808778953852816765786832779803812745814811763780833741764709765681783700698787655642730711792717717731713698703730691699707666759630666671676682624669646651661672627686673643622684630605640638635601590655649610626633597665579661650610591576576579617591618580595591613615571575626534611535581544568580573565541559539545586534535557566555515517595564557551517547515509508524592528510515502497518551509582550580527503547515511480511468519488488504461537521516502529509517493480497486513494471476523517469492484505474454470476487464437444474446463441473452468400412469439452400441461406473481425475410384441399446485409412309 934100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

573 5990000000000000140 999 60500000008 390 5220000127 685 16700000429 462 0350003 186 240 17200000510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %51 821 79099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %51 743 23899.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %78 5520.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %25 955 67450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

84 %43 584 62884 %16 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.1 %6 300 23612.1 %87.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 261 066109 91528 396199 80596 78655 36187 298133 49230 904245 71968 15361 193167 790102 85741 940191 61144 56762 60077 002105 43647 085186 17690 604108 000182 261326 88740 905784 23646 618146 43357 47279 14835 104195 68627 40049 16060 256109 55336 077263 9231 975 10680 72571 089107 373169 337138 62591 928108 798113 451213 461200 159230 473952 76745 689172 854107 28641 698280 293126 54128 42440 791 139051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.86%99.86%99.86%99.82%99.84%99.86%99.86%99.87%99.85%99.85%99.85%99.87%99.86%99.85%99.85%99.84%99.85%99.84%99.85%99.83%99.86%99.86%99.9%0.14%0.14%0.14%0.14%0.18%0.16%0.14%0.14%0.13%0.15%0.15%0.15%0.13%0.14%0.15%0.15%0.16%0.15%0.16%0.15%0.17%0.14%0.14%0.1%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped