European Genome-Phenome Archive

File Quality

File InformationEGAF00002340499

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

30 936 44364 762 380112 526 254170 089 263228 634 947277 458 080307 107 077312 936 576296 200 417262 492 205219 272 952173 766 054131 373 57295 291 96666 665 14145 233 61029 865 16019 271 26812 252 8137 750 4444 903 6103 144 8722 066 5881 405 4121 000 365749 205586 258479 308403 841353 054314 865282 891255 443233 555214 236198 053184 442172 562162 025151 387140 717133 763125 775118 560112 026105 846101 10192 73087 84581 50176 98872 53767 89263 07159 11855 83753 02549 78846 67944 21442 45640 21838 11836 18634 19433 36531 75531 00230 06629 26427 50726 95825 67626 04125 45924 13422 78922 33321 89921 62721 32820 78720 14619 51319 37818 47418 25517 79516 90416 59616 10016 06615 63415 22515 40515 15914 57314 44913 80213 71213 09412 66612 49712 63512 65012 20612 24111 80711 85811 31611 01010 86810 65810 57710 07910 05310 0559 97510 0749 9909 8789 6429 3409 4129 2129 4349 0028 7958 7898 6348 6318 3568 5348 2828 1218 1247 9927 7717 5637 5327 9097 7467 3317 6197 2237 4567 0717 0186 9406 7516 6926 7476 6236 4696 5196 2866 1296 1966 0536 2556 1495 9425 9765 7245 5595 5185 4875 6085 4705 3005 2885 2375 2805 0325 1145 0385 0584 8524 6244 5494 6974 6954 6644 5004 5504 4194 3154 3084 3994 2894 3384 3044 4044 1554 1894 2744 2874 2924 1013 8444 1263 9933 9213 8373 9053 7393 6563 6973 7663 6593 7293 6983 6973 4973 6893 6733 5943 5223 4273 4033 2993 4433 3143 3433 1692 9963 1043 1413 1713 2283 0822 9222 9983 1543 0262 9353 0293 0713 1743 0013 0573 1233 0673 0612 8162 7662 7422 8012 8392 6942 6892 6752 6402 6672 6512 5982 5612 5582 3792 5102 5032 5322 4162 3982 3892 4182 2562 3152 3782 3572 3652 2282 2792 2252 1302 2192 1302 3182 1862 1052 1512 0221 9731 9791 9872 0461 9902 0462 0451 9881 9971 9471 7831 8051 7891 9061 8911 9271 9691 8621 9071 8261 8101 8421 8821 7231 8471 8151 8211 8411 8641 9141 8481 7101 6751 7271 6261 6501 5851 6231 5801 7441 6401 7421 6541 6451 6031 6331 6411 6181 6091 6601 5361 5561 5721 6491 5571 7631 5951 4651 4501 5391 6371 5401 4151 4061 4471 4021 4591 4621 4341 3851 4291 4211 3831 4281 3901 3311 4321 3691 3751 3141 2391 3071 2251 1741 2221 1911 2601 2111 2051 1951 1471 1751 2061 1661 2241 1941 1811 2231 1631 2021 2471 2081 1651 1961 1861 1171 1581 0711 0851 1031 0581 1141 1031 0841 0391 0821 1551 0609991 0411 1011 2041 1751 1431 0401 0061 0471 0141 0511 1041 0611 0061 0851 0171 0359831 0579901 0101 0299379871 0381 0371 0631 0569749991 0209939849691 0021 0089529999239169909529589411 0049769909959619939149599689601 012992982908924931926894857848919981936936910958944930925948943908883806866914880888893856890910867902869818839848898795844881856877869824808834819753750790844806757806759744762742769762746746750826793808836785794785772769807786812745733845776787738719735700657693719702689710754670664752704691655639632594633629646632657612659642630607609630606627600609592626607599639609597602634631650616617608630579571578592605660599586562615522556578569545589560559588615619556631593576605537567621568592573567550505521596546521555569554524503548525548526516546511522528571539545535561573548566560542541555512550547558534516536496530530500493515536490517554460455482507466493475422475470482492488453518512509498530523458499550524511484497509460411426467455451475476426444459484504520498505436464439435454452443460437420460467444438432420447422407427447401429460413419452418469412431443418438453381420398426413431413429440426409382387385415364413379392403367381368417407411379360354368364389388408414392377370402357392339384350388385383366344402382375403375380348381386347359341341340385390327336339338328349319339338322356364333359353352348363363325332364368345342358358388328360359366364356364390332376359358334353341324349317360349339355387376374406384316343319333321357309325330351336334308323341327317301328319332302330296336330292294312304291297304343322281315316321279287296298289324319336307319330327340297314293317300293278275314277299303267259270288297308262277270304315334309295307267316264244283283297262289262268271281272263271293314251294264274294263246241246266257224264249252261261235260273286278244274270274264244267283272269267276253280 497100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

5 503 857000000070 062 0490001 341 236 368000000000746 299 3640000788 514 69100001 673 325 62600003 397 833 48200018 252 881 63700510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %173 246 09299.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.2 %172 579 94499.2 %0.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %666 1480.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %87 005 48750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %170 441 35097.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

13.3 %23 179 40313.3 %86.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 039 820157 23093 442187 096136 709146 582173 543227 54396 318169 53474 15565 482102 785107 42352 756128 15988 903102 087137 537175 312182 356177 758209 788164 208274 277460 97627 422824 94942 03640 89595 68386 98537 716103 95141 88541 81178 69594 83423 229139 3282 302 487100 07198 769155 128135 351243 839217 208311 309535 01362 22981 96171 46396 87051 24797 53990 74169 097235 36374 235134 386156 710 704051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.6%99.6%99.61%99.63%99.61%99.61%99.61%99.61%99.59%99.59%99.59%99.6%99.62%99.6%99.58%99.63%99.59%99.6%99.61%99.56%99.61%99.62%99.75%99.45%0.4%0.4%0.39%0.37%0.39%0.39%0.39%0.39%0.41%0.41%0.41%0.4%0.38%0.4%0.42%0.37%0.41%0.4%0.39%0.44%0.39%0.38%0.25%0.55%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped