European Genome-Phenome Archive

File Quality

File InformationEGAF00003663099

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

187 215 512317 730 357412 141 632442 087 226409 866 030338 348 755253 944 928176 281 298114 474 89270 483 11741 554 74823 705 96413 249 9187 410 3394 207 0162 485 6951 575 6181 069 600780 535606 753493 682420 670362 702319 605282 745249 913221 864198 055179 435160 677145 222129 276117 182106 12197 65988 22281 74275 81869 79764 57360 38056 07852 71149 75846 86344 55241 42039 66236 56134 34132 49432 21530 13428 67028 56127 13026 03324 46823 73222 50421 57421 20420 81220 51420 07318 70518 08518 21017 28617 03716 44715 97615 40115 22114 93414 62914 61114 00913 94413 14413 08012 89712 85412 45211 98812 13811 94811 27711 33311 18010 92011 00910 86010 11510 25710 0179 7089 4719 1479 1749 0428 7338 4728 1997 8387 6697 5807 4077 4307 6687 3727 0337 0556 9076 8836 8636 8326 5616 4496 2426 3996 1116 1246 0705 9575 7765 7995 6675 6095 3615 5185 4865 3345 2385 1454 9534 9534 7754 7684 6474 4794 2904 4474 4234 2274 4224 3664 2114 1714 0884 1834 1604 0703 9813 9823 8093 7773 6033 6483 4943 4223 4563 3563 3273 2993 1273 1243 2083 0493 0583 0853 1092 9933 0462 9732 7642 8512 8032 7122 6802 6752 5602 5182 5532 5752 5302 5752 4112 4522 4612 3352 4932 3632 3262 3162 2072 2222 1332 0922 1662 1952 1782 0852 1402 1552 0942 1572 1312 0832 1032 1052 1582 1612 0692 0742 0622 0002 0062 0452 0392 0531 9321 9201 9191 9211 9511 8861 9071 8741 8221 7481 8121 7991 7671 6931 6771 6001 6731 5821 6501 6601 6111 5841 5241 5261 4981 5081 4611 4681 4911 5651 4881 5221 5511 4561 4311 4931 4941 5371 5321 4551 5031 4521 4701 4551 4351 3441 3291 4481 3941 3361 2911 3241 2851 2691 3091 2171 2211 2781 2551 2101 2441 2531 2311 1681 2651 1471 2241 1371 1391 1691 1141 1001 1271 1071 0941 0761 0241 0141 0521 0011 0541 0311 1111 0371 1171 0701 0151 0331 0611 1201 0001 0241 0229461 0309681 0111 0651 0651 0031 0391 0211 0231 0259909689711 003884873947952982936936911890882858913965930873864868859853904863821901842820858820773749813815796772726844799806823770742768776809769779837710785802785768761746771761732715737800729684712670706728781691656684662680684701717646710653643741703579591619611647647642612622600680705654624616633683647661609605619650641690643664672670645673637656630625583623548586578577538549563573560558534530549522567570565660639543535600593605580588568543554598599636614594576605605562538557520545513520543498550605511542532526541522553537540569561602517562552580565570510506573545539580580580595604564537553612547530582591595569584569612592572618520612549591593556566573578620574639616621585556598625552518613631612596565552525576580573522543594580593615539563576544570534563559538567540582554557534589542513536559557578580539536519559559601583530527569547568555512538479519530469497441493466469456490487487488481488493472487475442442473461436470434431433425469470433451426494395460388428444432458430432403431441438423438432416428434393427482442441426473400437474426447446478444421413467428425448473446527464488539448443458438438494466467508489447490507490485448493482497472514516493547527471484514495519454451479488494489504498479454470478491487431502469442468477484461460449463455428462411461461426429447424425417452415409460438435437404381385458412416404370397406403429369378362365389408383374372353370327336323301316356325300285317286314282290247308307284251277264291272274270275261252272249276289268262259274226266260257244233222211229232232210260211232227225221221212197216216231198185222219216174192229214219194240212212234213229209202245250204216221212225213193184217197218183213195214200235228193215193185184196188181181183188212185176192203211184208209194206180179206184205199215201204238212205214217206201226222203228215189197215202201204190181201258233225215216204191177192180206212209210213207201215203209195228205215210219221204237213219223250236233214214230217201240219258241229234256217231246259255275241256251264259194 985100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 486 610000000085 158 623000686 211 702000000000394 326 2700000455 864 7530000903 569 36200001 867 695 26800010 563 043 67400510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %98 309 32399.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %98 164 67499.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %144 6490.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %49 530 98150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.7 %95 841 21896.7 %3.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.4 %9 305 4119.4 %90.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 009 36396 33556 227116 01384 41091 713101 287145 03862 568109 65948 60343 20259 69570 92432 78681 94956 78465 31879 809111 296110 678116 047139 508114 542189 168310 33418 757509 41126 63726 55950 36957 69327 51969 52627 67428 45444 55061 89016 24791 5571 264 21661 08358 637100 17685 092160 396132 576224 523309 21640 80450 34147 68755 74329 44255 79654 56638 915134 86241 72980 85688 493 478051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.85%99.84%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.87%99.85%99.84%99.86%99.84%99.85%99.85%99.9%99.79%0.15%0.16%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.13%0.15%0.16%0.14%0.16%0.15%0.15%0.1%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped