European Genome-Phenome Archive

File Quality

File InformationEGAF00004040131

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

10 631 66724 059 60046 294 31078 559 230121 009 131171 676 569224 133 185268 974 211297 876 835305 263 867290 788 733258 857 257216 465 487170 989 274128 208 29291 565 76762 776 85541 372 16326 473 45116 533 92710 209 8216 308 1723 987 0142 619 8221 826 7551 359 5571 081 063897 856770 516677 285605 946542 817494 594449 394413 735377 513346 139320 996298 112275 551255 276237 184223 549209 679197 619187 548178 535170 145162 991155 626148 522141 868136 391130 012123 568117 024111 657106 544101 17396 42590 96586 58783 64379 33476 57172 73168 51865 20462 86758 96856 76153 80051 25649 13946 65345 89843 61741 43739 53838 22436 95535 33233 84033 39932 20030 23429 47028 42227 83526 79125 52126 17825 38324 41224 16623 35222 51621 61521 09020 63720 26119 04719 84019 01918 41418 00117 43217 46017 15016 61715 72515 61315 58415 42815 04014 52914 34914 36513 71713 42113 15112 75312 57512 08511 92312 06812 15211 78611 55011 67511 40411 35911 27410 98511 03510 69510 49710 43610 39310 51610 28610 1199 6479 5869 5159 3229 0169 0748 8429 0318 7198 7748 8408 7428 6208 2608 2508 2557 9728 0577 7287 8007 4267 4867 5947 4567 4127 3027 0547 1607 0346 9746 9046 8346 6416 5786 4716 2606 2296 1896 0465 8956 1195 8955 9075 7035 8635 6955 4745 6495 5365 4965 5505 3565 3885 3185 1385 1995 1245 1974 9065 0095 0064 9054 7844 7554 7704 8084 7184 5094 6194 5874 5674 5174 5634 5384 4584 4184 2614 2684 2694 1854 0573 9634 0823 9194 0094 0033 9383 8473 9543 9183 9243 8043 6693 8153 8483 8673 9613 7253 7343 6903 6863 7233 5783 5113 4783 2513 4313 3223 2583 2693 2413 2093 2023 1653 1543 0953 1283 0413 0663 0582 9933 0403 1182 9733 0653 0562 9993 0253 0212 8242 7262 7792 6992 7792 7532 8152 8252 7532 7062 7422 7552 7872 7062 7402 5992 6362 6222 5902 5392 4332 4802 5192 3852 3542 3382 5002 3122 3712 3502 4392 3082 2502 2522 2662 1852 2012 1762 2142 0782 1062 0952 2092 1872 0682 2092 1082 0461 9982 1022 0152 0552 1492 0332 1192 0432 0202 1092 0412 1432 0382 0902 0862 1102 0611 9792 0401 9611 9571 9962 0432 0351 9181 9562 0192 0711 8451 9141 9361 8481 8661 9031 8591 8821 7761 8571 8071 7921 7091 6791 6761 7731 7171 7541 7021 6551 6281 6391 6731 6591 7461 7191 7201 6671 7311 5991 6901 5621 5471 4991 4791 5221 4251 5021 4511 3841 5171 4661 4361 3751 3401 4431 4211 3811 3831 4281 3611 4191 2951 4621 3731 4431 3451 3651 3691 3201 3301 3561 2681 2921 2801 3051 2511 2491 2721 3111 2691 2691 2251 2101 2321 1631 2721 1831 1371 1821 1721 2061 1901 2001 2841 2451 1921 1301 1521 1481 1461 1741 1101 1151 1201 1411 1621 0401 1321 0991 0981 1481 0741 0531 1081 0611 0521 0521 0821 1141 0481 1031 1101 1171 0471 0701 0179789971 0459009309389679839759369251 0019509729879511 002990907885940959918956976988990987916906913922961935966894923924855934894873916877911902877929925926884935934932956957928894887896879872882850907877863841877814825867832831882804779809808899815873802807801845777768739759738790755800767756725789802772750786734739751704757729834791748857727770721727720722770686710680744720727713749733773754738756712728726708744723739727708694723665710739677700652668687695681662648689677679705652687628629601678632540582609591588625654563591600604558588623591612594615612566573549597636584649583572535528534520562499539568559585541525553544530580567511541564553524546510567529537551492534577470518534576542528542507552527509523543501502534517514539526496507522523503537512472524458485500499517496458514454496519476476485474434466456446473421530516455477454444463454451445439412461476451450451440427488460443447468458456472432425431462454476446451448498440423453449446459430429493423443422474446506451468479418399430447402428437422436426416441377387395430406337395370374386385397393346352368387381375344363361369379373395382383398387355342342401387374359392384434395415385367419426408410407386422379408392412348370383382374408416406408371382386393424406384408407385375372357390380343351350354346370337361348372356366332348361328317324339322346366354346318328326317325313356270391331356341338319357347366370368329320348336325326336347342339362347362364349368350363343354306343348366335316343319329311337312325300338326333328286319320320299317288297334295280265334301288298279308293328323348351333326323329343307285287287324277377 932100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 877 4520000000161 663 6880001 620 830 562000000000925 263 37000001 048 311 90500002 101 295 82200004 154 455 60900021 966 919 78800510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %211 034 69899.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %210 553 89699.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %480 8020.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %105 899 39850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %207 285 09897.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.2 %13 063 9186.2 %93.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

9 484 486228 982137 079269 932199 701211 886240 599326 848140 018246 397113 41998 802141 473154 14779 444184 421127 609137 254181 637239 307239 852247 992279 300219 999358 082586 62040 2661 065 94461 42259 519125 571121 34753 660148 88859 72458 916100 703133 50234 178201 0412 980 881128 945121 426202 098171 011311 245279 401393 578670 98281 802105 88894 052118 27959 483102 253109 36279 135289 29286 430165 040188 863 116051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.76%99.76%99.76%99.77%99.77%99.77%99.77%99.77%99.76%99.76%99.76%99.77%99.77%99.76%99.76%99.78%99.76%99.76%99.77%99.75%99.76%99.77%99.84%99.73%0.24%0.24%0.24%0.23%0.23%0.23%0.23%0.23%0.24%0.24%0.24%0.23%0.23%0.24%0.24%0.22%0.24%0.24%0.23%0.25%0.24%0.23%0.16%0.27%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped