European Genome-Phenome Archive

File Quality

File InformationEGAF00003247006

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

27 689 98459 814 060106 037 652162 943 173222 001 597272 406 115304 452 085313 050 857299 054 709267 023 489224 748 473179 134 576136 261 88699 365 13269 697 12147 321 77531 224 14620 145 37312 810 3058 083 3755 121 7953 295 3452 192 5051 524 8621 122 569869 498708 238601 233519 973457 544412 869374 453340 421311 506283 385263 044241 591220 852205 509188 629176 697162 380150 833139 681131 757123 920115 655108 858102 09796 51991 78486 76881 42877 16372 35868 67064 74561 96959 21556 27353 03449 61447 36445 18542 66240 98439 77338 14436 22434 45532 74531 75530 65230 53128 95728 16227 15926 40825 84124 57524 19223 62822 39622 53021 87621 36520 28419 31519 19118 74518 65617 69817 49516 98816 58615 84215 81415 60215 59315 45015 06414 21113 99913 71013 43513 00312 70312 66612 11911 93511 87111 76411 39910 91810 85010 90810 51210 78210 79110 2819 8719 9819 7279 5129 5479 4669 3629 2369 4788 9538 6168 8618 8618 5528 6028 4538 3448 1468 0338 0007 9057 7577 8817 7007 5217 5127 3087 2537 0087 0126 8406 6446 6536 6366 5296 2656 3756 1926 1396 1016 1466 1005 9905 8495 6975 7095 6775 5345 4415 4525 4425 3255 3325 1995 2905 0145 0294 9444 9234 9054 8864 7584 6704 7784 6084 6324 6624 6044 5454 4994 5044 3764 3994 4094 3554 5134 3284 2974 3354 4794 2174 1374 2224 2414 2784 1194 0473 9734 0073 7943 7613 9343 7763 6343 6163 6923 8343 7553 8053 7443 7613 5373 4733 4263 2933 4423 3543 2993 1333 2763 1943 3013 2413 0233 0942 9373 0272 9252 9652 9583 0512 9342 9002 8412 7252 6632 6332 7162 7642 7132 7442 7522 7182 7932 6752 6932 5792 6102 6602 5852 5972 6862 6022 4312 5142 4392 4792 5392 4352 4122 3732 3702 3332 4022 2672 2442 2112 3182 1932 2292 1832 2062 1592 1322 0992 1282 0602 1181 9902 0322 0752 0391 9261 9981 9631 8811 8971 9251 9081 8691 8961 9231 9231 9071 9231 8571 8231 8271 8891 8091 7321 6841 8131 7441 7101 6941 7021 6421 7191 6951 5931 6521 6611 6401 6081 5931 6501 6091 6371 5421 5061 5091 4961 5071 4231 5791 6101 5201 5591 5021 4631 4241 5991 4391 4661 4861 4991 4481 4551 3731 4751 5021 3691 4101 4591 4161 3301 4051 4321 3031 3561 3491 3471 2941 3121 2761 2871 2841 3041 3111 2741 2811 2551 2311 1921 2371 2211 2261 2001 2091 2641 2291 1231 1951 1971 2461 1721 1381 1641 2051 2181 1401 1811 2231 1471 2001 1841 1781 2271 1891 1081 2021 1821 1191 0941 2321 1011 2021 1141 0651 1681 0941 0801 1101 0471 0291 0901 0321 0171 0591 0031 0409579649681 0021 0029499989549298899079169199038928919028968618848809509209739308809758339449038388668778939089498998409169849311 058846901897887862815853918809875884818791824827791832856795779827896874816816772820824760790774819787772776816766823743755849808827854779802750761779759793740781748751694703797739731774738703735740710712689697719718695722687707693691737653667718594676642638647635679644620613649571614631683650625637579622602590624589559616608571570567540577563581521560573598581504517592572597588566561570598579538541555570564536533489525509519522513493486565515497503493449509497536543519592496558502508438524499546539561526569554472453463511471459505485487470461458522513474466462448449404518499460447429429494449429514472457438464417426488450459427444442407443469398448430429452435394398430437435376435434466428445380371414400371380416392397380373368392420441382424401422425415424430394445430429421398409423416440387417454433424413459431428426418433439413424430395458456475459414459422434430398407374418396397416415399411447442429400433407423419428436419428444414428403435422406396401404417391399368359400396388408379373425430363371352391372385387374355360404362389403383372374379390357362386368401376355351355386359372364321343378350359319374350355328380361370364377370404375385396396382415428425411377376388406417394397389398417395368444386385416383374395385386395421374393382392426401396376404385358391407392392391394406368365349364343360344390360370346372390350347355341373360327358385343331338365410337363380358348340373368329334349345357360357340354366376347298358326345353349324366356317323318342365306335326340347353321324350329363321329330317282282311274300318344319309324310297326305297333307288308325272280309307322 472100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 728 957000000043 886 4040001 296 474 295000000000716 304 7220000761 167 29500001 680 323 87700003 668 373 71200018 851 371 25600510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.9 %176 933 94898.9 %1.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %176 721 91298.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %212 0360.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %89 478 90950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %174 081 50697.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.7 %19 160 32110.7 %89.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

9 535 092158 76893 068189 399135 999144 464160 973227 53995 672174 18878 11469 04697 074115 25657 354137 81999 805111 917136 883192 495187 525190 337230 786188 130308 594526 47028 096897 59242 59542 34482 50991 71838 675112 79243 66043 54671 03397 52125 209148 3532 201 607103 33899 537170 171146 745276 469231 574395 215557 75266 18385 80277 95296 11347 10584 25892 96564 686244 58568 851140 010159 244 037051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.88%99.87%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.87%99.89%99.88%99.88%99.88%99.87%99.88%99.88%99.92%99.82%0.12%0.13%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.13%0.11%0.12%0.12%0.12%0.13%0.12%0.12%0.08%0.18%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped