European Genome-Phenome Archive

File Quality

File InformationEGAF00004840762

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

83 900 882159 898 052240 982 129307 771 488345 405 372349 159 469323 554 689278 137 044224 469 174171 523 605124 917 09387 365 27858 923 77738 570 76924 663 69415 436 1649 569 3035 892 9683 671 6502 318 8491 525 3451 046 700759 790580 089460 328382 586329 754288 244258 246231 717209 287188 786172 931157 578144 694132 054123 586115 720107 786100 42893 94588 20281 04275 70769 21465 09760 44856 51852 64349 81946 77444 74842 02340 91439 10637 03435 33933 84631 69431 84630 42428 82726 86326 70624 92624 60024 08822 86222 28221 44520 29319 95219 23418 31518 26817 71516 60716 57815 62914 90214 97414 52614 59314 12013 52913 93213 23412 79913 02612 22012 13411 46911 13111 15811 04710 97910 73710 82710 37210 14710 08110 0039 5279 6169 1369 5109 6469 0389 0559 0248 8468 9308 5458 4068 4008 1397 8217 6157 4887 4437 4357 4367 3177 1176 8346 9586 6526 6646 2816 3466 2646 1796 3956 3736 2095 9985 8745 8435 7305 7215 7545 7895 4875 5205 1705 2255 0505 0015 1105 0635 0134 5634 6374 4114 5704 5014 5174 3734 5754 3304 2354 1174 2044 2894 2204 0323 9264 0724 0863 9843 8353 8423 7223 8583 6543 6923 4713 5743 5163 5293 4663 5223 4353 3903 2163 0572 9952 9553 0482 9643 0162 9822 9922 9912 9792 8682 8432 8432 9312 9602 8462 8512 7792 8012 9042 8832 7992 6332 7272 6332 5392 4302 6602 5072 5162 5742 4782 4762 4032 4642 3582 3602 2552 1162 2322 2592 1712 0812 0832 1152 1042 0442 1532 0802 0512 1891 9902 0271 9742 1162 0212 0301 9932 0322 0292 0062 0891 9531 7951 9371 7201 7171 7331 7571 7991 7201 6801 7661 7431 7681 7731 7981 7161 7991 7561 7761 7011 7641 7291 6971 7881 6631 6841 6201 5781 7491 6031 7421 6341 6641 5471 5331 6231 5941 5211 6351 6151 5931 5611 6161 5121 5441 6751 5731 5581 4571 5211 5261 4871 4231 4051 4511 3581 3741 3771 3861 2231 3591 3211 3091 3011 2411 2831 2371 2661 2451 2241 1521 2891 2991 2701 2211 2391 2141 3201 2271 1751 2281 1751 2361 2081 1481 1751 1371 1131 0391 1241 1081 0891 2271 1971 1331 1661 1411 1491 1291 1081 1331 1281 1241 0761 0251 1031 1771 0611 1491 0681 0181 0061 0221 0089841 0501 0229971 0809859859811 0431 0141 0701 0199879509931 0279711 0471 0299541 0579881 0149819699851 0771 0079469549659369569439679319569369349259509569359779949411 0109479681 013931900830891860832919840900813887791833806816791786790880806760823786808807841788768765726750729812776830808777753806807792863769715794763747719734761756798737763766727734737703667702710693695685741735688696713702697716658660716692670634650683677640576600635650557592615585597620599607599621617578619564584608580606558554545580565565578571551542545540572607585549592598599541646531545587584547529540513548545599521533529533522524510500496491509516538512534520477519539531530522480519492533516488477477441507517471444481476462428447448495472475445456466453441463463444411441445419475439452419439397468423504463445427435475451439420423405407434417399423455415398417433395388413408408403376384394405434437381398378389372404375379405381361329412380376382377397363334394399408376383352359390358352397394367344361389389343352356355323333349341357335376358388369335366366389365414367413373411378361408362370316365334354347397370347360362380409399352384379371346346368395362364341366376368409375396351364377408413406379374398414348339367383365394352373373362405387364362346386392373367379420391386373387416377376391376386350371361320340334344316319327329324310344334375354334348343340332320335313331377412384353338327331358387368311331339301317317327290347310325341327292326366348338337323328307354339341316337376348372316334344360321350355342351327341305337361326381367355352342340340349366314333340300337309310338352294329342335338341317302307291351339314292321314275296265275264260273307254276302272257258279275273277326339305325281292286309313275311285291294313265303312312298270285287306289281270255281299278263318277307263278268273277272271248284257266246258246263276277220273270233242251282290242237248233280253249262260272262252295254286260257281240261256250257256283236 792100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

17 662 540000000073 302 591000750 495 624000000000458 779 6710000578 527 72700001 200 945 61800002 489 131 80700014 786 168 54000510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.1 %133 547 28599.1 %0.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %133 431 70699 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %115 5790.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %67 400 70950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %131 125 01897.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

13.2 %17 791 68113.2 %86.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 605 487115 19369 368139 219101 692107 719118 641167 33378 798131 88759 67451 58667 91982 29842 266100 62270 78678 47896 059140 188141 744143 991171 379134 586214 087361 16123 975641 08634 23132 46259 83667 58933 96884 41233 91734 32051 99473 44020 521113 8961 545 97276 16670 304125 240104 428196 570168 534266 388421 99049 06661 93555 05670 09332 32367 05362 34643 721170 39546 12097 331121 724 360051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.89%99.91%99.91%99.91%99.91%99.91%99.91%99.92%99.91%99.91%99.91%99.91%99.91%99.91%99.92%99.92%99.92%99.92%99.91%99.91%99.91%99.94%99.93%0.09%0.11%0.09%0.09%0.09%0.09%0.09%0.09%0.08%0.09%0.09%0.09%0.09%0.09%0.09%0.08%0.08%0.08%0.08%0.09%0.09%0.09%0.06%0.07%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped