European Genome-Phenome Archive

File Quality

File InformationEGAF00004921570

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

70 971 535141 773 199218 334 373280 439 386316 237 862323 427 735306 655 990273 606 719231 855 753188 036 771146 581 479110 438 96280 602 83057 182 42639 562 12926 797 81417 836 15411 783 5767 712 3065 066 8303 355 8502 272 8041 598 5581 163 278881 425696 175570 693483 228420 695370 029330 383300 613274 627247 429231 168213 710196 677182 898171 873162 684151 084143 419132 909124 698117 718109 849103 07595 92890 70084 47078 33173 07867 20262 76358 94355 78353 12249 04446 47943 89441 40339 33237 32135 17334 03232 11330 65028 96928 02026 71926 08225 48123 92223 95323 24522 81122 28021 50120 37919 41019 04918 52918 09717 19016 86016 88416 37915 72415 34214 89714 60514 95414 35013 54013 15212 78312 93212 40712 27211 80011 43011 15010 68110 51610 0659 8879 9219 7689 4219 4059 0859 1239 0628 7138 6338 4018 2088 5148 4348 0037 8967 6577 3727 0857 2267 0517 0496 9706 8096 9146 7776 5876 5406 4146 3385 9685 8785 8735 8165 9075 7545 5505 6375 4895 4515 4415 4655 2965 0925 0614 9904 9214 7114 7374 7584 6964 8794 5774 6094 4404 3554 5514 3114 3924 3884 1734 1304 0054 0343 9523 9353 9923 7303 9403 7213 7753 6253 5793 6033 6163 6673 4733 3563 6783 4183 3573 2483 2333 1923 3743 3033 0943 1533 0652 9522 9132 8442 7962 8562 7602 7302 7642 7522 6722 7532 6502 5812 6772 6382 6762 7422 5952 5342 4652 4762 5592 5222 5062 4372 4662 4662 3662 3462 3202 3582 3662 2052 3722 2662 2882 2922 2002 2192 2272 0112 1222 2092 1582 2152 0722 0751 9552 0622 0802 0031 9841 9292 0092 0401 9721 9612 0692 0341 9951 9031 9671 8581 9891 8461 8291 7831 8291 8091 7611 7301 7121 7301 7451 7951 7981 6631 5991 5241 5601 4561 6261 6331 5561 4631 5331 5061 4831 4661 5071 5731 4791 4311 5981 5271 4681 4171 4441 5171 4081 4291 5821 4561 5291 4731 5151 4851 6271 4961 5191 3991 4621 4751 3661 4521 3831 3351 4561 3671 3161 3791 3771 2591 3321 3141 3041 3001 3401 2801 3341 2541 3081 2961 3311 2621 3081 2581 2161 2321 2171 2641 1671 2051 2651 1791 2481 1471 2441 2561 1971 1681 0791 1831 1491 1711 1951 1881 2291 0991 1031 1091 1641 2041 2621 1961 1691 1201 0761 1731 1741 1081 1291 0971 0031 0729921 0391 0661 0571 0079951 0201 0229879669529249669869319369319578939549451 005945955923936976953939973965930942960922846858834818806869805837839847854813816775797824908837843840823774803756778772736724771744723738716743691731720721683709717684656676638631660675735684704676666707689681636610610670626665614609556621566585633615598615567610570610553592594571538548593560587552519516505539529528594567540513460494550518525549547511500497467483510504496524518482482525485533523535541533564562496495537436460451465539498448450440410417470434436398382458466406415392441391436446416425436422399415419424394419445469405459468451438452429425427433480486492427433490459495410417462421444422406461434434428431440469435397427402464425428404420415408414431383378415396376406376399382391400387396386360374374408398366407380365373373389350350364341367338365339330371361353376359304315350379362349351353359302338329368336358293330366332349301331295304340324304312318307290319319331327316363337361367342322308314363342352321332332348318349305322296262301343361329341331313353299323340338336290307328281323313364280294335342339309318340301323340320309277298296280284273278288273280282251268288258270294282314267289318263294292281265263265248263282272274269278277223257250262279271273267262267277265239253279245229260269258237252260245253255249274257241263255273235254234259225242252253257225255234227226253272240231265254243231243226253268214232236244223232244239243216222204219225207229218231208183198205181210209168206203233205201209228202200208210183208209211215224212216202209197206192197200202223214214222231238211195224235223205211205211201194195204198198205211225206241216219240215221216232188210261228218223210233255241245236242230205205239212208210202217175218207204188224232214234211242217230211210253240206190230208227222207205226194198190213211259207215211233219244230263222231244237217244274257250244248230229289 119100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

7 387 212000000092 298 4760001 169 540 652000000000671 334 9730000715 429 74500001 525 184 95300003 104 289 38600014 852 480 61700510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %145 477 72599.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %145 196 75499 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %280 9710.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %73 304 45750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %142 582 27697.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.2 %13 478 9019.2 %90.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 218 234127 25778 144154 908112 634119 310134 398180 19283 322143 91364 45455 83779 05689 06247 136111 19175 02984 447112 619153 243160 760159 383198 533154 637248 184418 15325 205738 86037 13535 40170 20173 18035 59892 79736 95937 61860 27578 89322 447123 9011 818 28285 61783 073137 875118 098221 156191 307297 311453 07457 99273 50767 20682 84243 61384 27379 93159 116203 45264 106119 366131 444 264051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.79%99.81%99.8%99.81%99.8%99.81%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.82%99.81%99.8%99.82%99.79%99.79%99.8%99.85%99.76%0.2%0.21%0.19%0.2%0.19%0.2%0.19%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.18%0.19%0.2%0.18%0.21%0.21%0.2%0.15%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped