European Genome-Phenome Archive

File Quality

File InformationEGAF00002339445

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

858 833 219670 077 346392 954 938190 479 75080 730 28131 068 36611 266 2844 094 5781 598 519755 337433 529296 018224 416174 967139 844120 979104 68090 13180 98671 97863 15557 53553 16548 50445 33442 19939 68937 02235 74833 18031 35329 02528 16926 50425 07422 97022 35320 46419 98418 76618 51417 49416 38116 00815 36414 84214 40214 15013 26512 45412 06711 16311 49211 11010 3629 9389 4529 4489 3239 0179 0868 6498 3537 9997 4057 4857 1217 0656 8286 7876 8916 7286 6236 3626 0135 8255 8265 6895 6065 4085 1845 1485 1675 1775 0305 0054 9404 9934 7184 8364 5644 4024 6594 2664 2594 2234 2113 9953 8593 8933 8473 8403 8023 6463 3853 3753 2553 4393 3173 1593 1953 2753 1173 1463 1213 2123 1743 1323 0332 8052 7602 6292 6942 5622 5362 4212 3572 4362 4342 5102 3472 3792 2602 3522 1822 1472 1592 1252 0751 9782 0842 0672 1432 2292 1062 0592 0421 9291 8461 9081 8691 7191 6891 8291 6861 6741 7491 8551 6821 7711 6231 5811 6741 6131 6921 5741 5241 5501 5471 5531 5031 5761 4201 5491 4331 4541 4771 4611 4521 5001 4341 4191 5261 4281 4751 5121 4591 4251 3791 3861 3351 2921 2881 2051 3011 3921 3411 3951 3991 4451 4031 3721 3501 2351 2101 1571 2251 1561 1621 1211 1731 1061 1781 1161 0071 0251 0751 0449559769561 052914874910823828835808739763879858854873806797746858765818788756720718712731732670695721703691637634675668664648643583596626635632660672600648659643606679621644654618623591591600562561605582551537545544563597567538559554581543556546527570640496557602540539559528540512496545516522601552480520466517557492512549589548549559552542550505503517540558494555567552560551531487504466513466535547512501486475504453471505494482510534551528615557612517537559548575547564524570501573569534588589572559597582597580560550522568615599586590581596572654567594572575584623545556590588624595565624620605650617614630576661586588559565568555591587601574557629622591550550531552590563521525581533529544536545566528512501499506482483451457476484461499559492457447461470477445427461436423424407422410403374382392356406382347374344339311339348327331305331344352316330332330334294289313242208264220228269250224219216240241220241214232228230229223204229219221219209203197215206224231196188210198164204197190211204204204196190187194233213191192197208204184208197195178176174168187186180196191176231193222172168183163157165161174152170155146133149166154139136152154140136129148141152141149145152162169143127164138145152137123159115140139145149148144145118132147142138130144152141148138122153129120118120121113129144135125129126145133153140144138121140136124123123130113128136123127131108146126120153151113128147159140153169135116126129127127123113128131131121116119126109103125117123135130135126128141121128110134129127121121121132121129135139111125136129128148112131951111201221161049998981241051281121251059910710110311711311484105105109901011011048410387989483878295991011129098841079897751048992958188788610089101959282871069495999994839095719884901086690899886748278886998769178871029571849674927289887910375103859091827898767789858081758475669768737590656775727793517673688282796875667082818694767566655687767988698189847283789074827896747585748771738593908575737560556683767086767158788076766963808066777069677870704572797774497166724971485861676964604666415961596857565449504760486645465157595363515456595876648069737257625730 524100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 059 543000000039 654 647000357 562 008000000000207 008 8440000233 297 0700000415 461 6120000788 907 1650003 073 739 37700510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G2.2G2.4G2.6G2.8G3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %33 785 50499.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %33 715 36499.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %70 1400.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %16 942 68350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.5 %33 364 78698.5 %1.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.1 %2 731 9788.1 %91.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 300 88631 04118 58136 50726 61928 02032 12442 88317 59333 08214 05112 66718 60120 5549 90725 26516 64019 27525 64333 60235 27033 48438 94631 55451 47389 6495 297159 7907 9187 73216 75016 7446 76020 3128 0027 98514 19818 4534 44327 591462 32919 29319 12630 14726 52648 68840 82870 11792 11112 83915 48814 70517 6759 62415 39118 01513 92244 34815 04827 75230 612 952051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M28M30M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.79%99.78%99.79%99.8%99.8%99.79%99.79%99.79%99.79%99.78%99.79%99.8%99.79%99.79%99.78%99.8%99.77%99.79%99.78%99.78%99.78%99.8%99.84%99.74%0.21%0.22%0.21%0.2%0.2%0.21%0.21%0.21%0.21%0.22%0.21%0.2%0.21%0.21%0.22%0.2%0.23%0.21%0.22%0.22%0.22%0.2%0.16%0.26%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped