European Genome-Phenome Archive

File Quality

File InformationEGAF00002492868

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

64 466 904132 501 021213 165 468286 280 555333 848 912347 774 531330 288 400289 991 534238 437 829185 107 851136 755 99496 797 40166 114 29143 794 41528 318 24017 988 69211 321 0067 123 0894 540 9232 963 8622 004 3181 414 9241 054 545817 619654 475544 809462 795407 624359 083319 819287 772263 756239 675220 720202 202186 502173 438164 105151 990142 416132 469125 844118 018109 485103 32297 83792 49887 80084 53079 84776 27272 44168 92366 29163 78959 56057 40154 91853 43551 28649 50846 81844 52942 14841 25040 52938 58737 46736 14134 98433 56633 71232 03831 30030 33629 79728 83728 27027 60826 44825 89425 12124 37524 07623 92222 73022 54922 15721 66020 70220 69220 00819 62218 78918 61218 22417 76117 16116 95516 81516 26315 95615 36715 03814 79514 57014 00814 15913 87613 54713 22712 98513 05112 81812 31412 14711 85611 55211 16911 09411 13510 74610 70010 55810 45410 34710 37410 2739 8679 7649 9159 6169 6589 5369 4319 3129 1259 3418 8338 8858 6658 5878 3648 2618 1048 0627 7867 6967 5607 5347 6527 6667 4847 3147 0797 0596 9096 7406 8346 9476 9156 8086 8466 6436 4226 4796 1256 0835 8565 8775 9756 0075 8045 7225 7035 7715 4695 4985 3915 3015 3415 1855 3215 3615 0975 0584 9715 0755 0925 0394 9584 9964 8364 8004 5674 5684 5084 8594 6014 6034 6164 5294 5234 3744 1774 1254 1324 0764 0074 0023 9954 0964 0024 0603 9423 9113 8663 7953 8063 8583 7403 8123 7453 7103 5623 7943 5563 5003 5423 4803 4093 3413 3453 4473 3513 3023 3143 2093 2093 2123 1773 1623 1713 0623 0303 0913 0643 0512 9613 0282 9652 8732 8142 8792 8222 7922 8202 7882 7902 8202 8062 7242 6902 6362 7492 6892 6682 6502 5392 6992 4722 6212 6222 6302 5772 5412 4632 4482 4512 3242 3772 5322 4642 4322 4592 4232 4522 4292 3352 3722 4162 2552 3062 2712 1112 1262 1752 0692 1332 0462 0511 9731 9892 0191 9871 9801 9661 9641 9371 9101 9631 9911 9561 9532 0151 8841 9641 9421 8801 9091 8471 7801 8591 7821 8901 8451 8191 8001 6891 6871 6791 7571 7821 7121 6711 8191 7291 7061 6701 7091 7721 8141 7381 7391 7771 7001 7791 7361 7081 7021 6751 6001 6131 5771 5851 6081 5991 5451 5241 5591 5001 5801 6031 6001 5221 5031 4921 5521 4521 4731 4481 5391 3801 4381 5771 4851 4751 4841 3961 4651 4051 4421 3721 3751 3241 3941 4061 4271 4311 3831 4421 4281 3821 3911 3341 3401 2201 2491 3491 2681 2881 2361 1991 3081 2521 2671 2091 2561 2311 2161 2141 1541 1761 1831 1691 2101 1701 1681 1461 1371 1871 1571 1461 1731 1691 1671 1941 1011 1531 1431 1111 1671 1311 1631 1361 1551 1611 1181 1631 1741 1421 1311 1061 0621 1321 1081 1131 0661 0951 0551 0519961 0181 0171 0881 0671 0091 0711 0141 0229639809909669379851 0309731 017948959903933911978951921938942966910976923941988957932878964935864866932878829862904894908931943828915858901937915890840913883868905864826826829855825846823807762831771792855768760867802822727786758802760783836850796766784817809781807818812762743740742736744695751716727716718672667665721711670612660634673676638644631627640685646646674702667659645720570651682593681631650660613627618662640662668678579607616633571590576568569562570573600574539534582611592558578616549586563567598600555530554513554579506548543540559496502541486511480492496521547496526476499516474541481501510536472493514502512500550483505432481503516499477463496490527500480558491531526502508438481477466457425480494441446451412412453390394438403396413407411392397413396362399412386400396411408409430363374364400417364362368339375366369378380341372362383364358336377356356361360365336376377361347363341405359322346425351348342372364383353350398373318336346319331318320354324348319338308346302297339302400323368363309318309316300311291374331341344361340298281319294318316324298309288321348299313283348320329302321314295272290287306296323287279296290299299305322281280306292282285310292271289303301302300304309310293275300279297292258300285332292321336300305321293293268283295303305277296276273294285276283260272263278250288288274252258245292279263291274261261284309292278251279251265232285268263285246239274267270258241262264294284250256257272258237263255243266268284248291239279280273290291286307301266279323299333298259262288264284258300295303282271274254298251256306283259249261284274297290285311243330294279253266280262289308 326100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 689 440000000067 268 1930001 024 489 771000000000578 078 6910000629 710 57200001 339 158 86600002 695 242 44900015 197 574 09400510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %141 525 79499.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %141 192 57499 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %333 2200.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %71 305 33850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.1 %136 982 82096.1 %3.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5 %7 082 2145 %95 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 436 858159 831102 763182 958139 209146 451164 854213 220107 699167 48481 20871 04898 575111 61263 297123 69187 07297 184144 567209 484202 225185 759239 731183 565280 931435 01436 248700 57950 11947 02085 59590 50646 441107 64946 38846 02769 94388 75930 337127 5071 828 733101 833100 150163 129137 655250 876218 927310 378500 60779 00593 38176 494100 53154 42283 23486 11864 221204 82667 936121 361125 523 353051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.76%99.73%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.78%99.77%99.76%99.78%99.76%99.76%99.77%99.48%99.64%0.24%0.27%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.22%0.23%0.24%0.22%0.24%0.24%0.23%0.52%0.36%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped