European Genome-Phenome Archive

File Quality

File InformationEGAF00003211987

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

165 596 664297 561 618395 042 645427 241 417398 807 715332 697 851254 910 277183 145 629125 457 39183 044 26953 697 74334 220 08821 808 90813 927 3548 976 2215 918 7913 963 0762 719 2261 913 0071 384 7581 033 692780 599602 263487 249398 108325 640272 497232 835200 110173 350155 382137 183118 774107 98596 70087 43981 22673 45168 02862 97558 73854 66452 07547 92444 20241 69838 15836 23734 68632 92630 95229 26928 40027 13625 83524 85223 95222 73021 97820 97220 56019 89718 95618 96918 30117 70617 01316 35316 25615 75715 59715 47715 58315 02114 53213 86813 50513 58713 15812 76812 23111 63611 83911 17111 30010 88310 96910 82910 97310 1599 8859 90210 0539 3569 5369 2358 7109 0988 7568 3978 7838 3038 5128 2897 9057 8917 8087 5207 5047 1627 2047 1196 8046 9306 7336 6706 3446 5756 3976 2156 1935 9615 7965 7905 9725 4785 6235 6215 4645 3255 4445 1995 2775 2085 1045 0085 0464 8914 8484 6564 7244 6794 6704 6344 7374 5334 4584 3684 3704 3294 2874 3034 1274 0533 9043 9523 8413 8453 7893 7583 8903 7543 6933 9313 8153 7043 6553 5163 5803 5273 4433 4123 5213 4713 5453 3723 1273 3183 1593 1313 2243 1073 2033 1463 1152 9782 9263 0443 0272 9772 8512 9362 9412 8632 9172 8212 7112 6452 5612 6522 5932 6422 5612 4142 5412 5132 4852 4072 4572 4712 5152 5422 4862 3752 3202 4592 3902 2612 2782 2502 2722 2372 2682 2432 2182 1392 1512 0021 9752 0022 0032 0321 9712 0052 1541 9541 9611 8961 9961 9391 9091 8642 0021 9001 7091 9201 8031 7861 6671 7351 7151 6991 6881 6161 7111 7611 7761 7621 7311 5711 6221 5781 5641 6421 6781 6391 5841 5941 5041 5891 5201 4611 4421 4571 3581 4631 3661 3641 4481 4221 3871 3581 3651 3381 3921 3791 4201 4811 4391 4241 4071 3881 3881 3701 3241 2641 3811 4161 2941 2311 2361 2971 2561 2081 1451 2761 1991 2091 2001 2181 1161 1731 1281 1151 0901 0981 1221 1691 1041 1311 0431 1281 1191 1381 1361 1651 1001 1329971 0151 1371 0791 0331 0571 1151 0801 0481 0611 019944961895942934812940909933859834840804773805823830798866801843790764820777750825804683820727751893755736825675722708768741698737656662678685670732755706664676656715682690659688737657727635629637634659624655669600630656603631596653637635617637643606600584647601600613633619600624570586637582600562662608653636648597618675648670642579617653609601606604605655600597573544589615539582564546539617592608593566561554579541612582590596604550579579565564531561536587602531551542624576530563536504569586506559555543518547563597574557558537546492543494552534503525525562523531569512567531563541537571545528550562586638564519573529512532549513516532570495550573595522523568560543573519534542560565535527571486516528557546548556517472516493497501510508498538481508485443450474478450497437481490495496470457510467454439453475467468494486443447441434434438446448460438421416429419407423448414420424422396386388401372382353384379375341340329371354361328360351364369366406351367375362345352399422366351376374375334348384387337375371318352377380400324315309372350362338367367400345357363365364376376360330358309308341353334300344334309332319320287311318283350303328311296288300344314324299297317290259295288302285297272277282264313276281251238273290287249269291286300272262258269251273261277267231261217230213238255218267265242252211251237223229245238235244246234219229239249228240212220222229217213218221221233219211220220208225237197219207209229229209212191195200213188198213203187242190203206208245245220222215237215204189205202188204191213230219204193197211204213168187218189188196201206172169170190190236184198208209181181229196182171179175173172187191182159177194166172152167149173157162157155159179157139154172214199218170145151167151168165158187145147179161182160166144166126171169177163154143147167179151171172165159154155163155161141164175173155152168151173159145160178174154145140146158149163137156174163139158141134155162142150177151141173140156148144167113142139118127125128134150133138170 204100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 653 006000000070 258 074000636 299 406000000000392 096 6620000477 758 5190000978 979 81100001 952 489 35500011 076 506 83100510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %103 141 76699.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %103 080 36699.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %61 4000.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %51 616 03250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %101 929 84098.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

17.8 %18 417 16817.8 %82.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 599 42796 41561 644116 13286 41789 88597 150135 24358 765104 62745 77239 19651 77762 28433 44277 37749 87154 68674 682105 223119 111107 677122 02396 831154 191259 15017 029482 51625 50524 83950 11949 65922 42861 27025 52024 37839 64554 54014 34385 2321 321 30454 24048 04585 69272 489136 291122 391182 619308 12331 37443 95736 59648 96419 93432 73541 28228 785124 94129 78268 68693 824 204051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.81%99.95%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.19%0.05%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped