European Genome-Phenome Archive

File Quality

File InformationEGAF00003247158

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

30 443 17453 573 96479 807 577108 724 121136 108 336160 728 024179 736 788192 811 889199 074 657198 992 779193 183 996182 799 194169 075 084153 374 839136 463 674119 413 140102 975 19287 586 63073 676 59561 224 05250 335 10841 060 39333 231 19826 719 50821 333 86516 955 02913 439 14010 583 1938 345 8826 566 4355 151 3224 058 7483 210 4782 539 2472 018 6691 624 2891 303 0251 073 713883 093732 126617 708528 778457 487399 071353 174313 025283 979255 930235 374216 200200 534188 946174 346165 573153 882145 435137 650130 121125 056119 414114 340108 197103 55097 94894 38791 35688 01284 15680 18276 37773 25669 66666 92364 57262 18858 47956 49654 02252 74351 18649 22847 09944 94243 51741 77939 72538 33237 20335 70834 38733 11231 32831 03730 80829 24227 54827 34926 41325 29925 12624 21823 61122 87922 60821 38621 18520 68320 15219 91619 36018 55218 35418 42317 81217 26917 30416 98016 85916 25115 98415 96415 34314 67914 21714 32014 12813 95413 70813 43213 07212 97212 87612 88012 65612 32212 46512 13711 63511 88111 74911 28511 24611 08010 41210 63210 42610 14010 3459 9829 7739 6929 4409 1599 4709 1569 2139 0178 7438 6818 5478 2498 1647 8607 9997 6867 7277 5107 5317 6377 1687 3717 3597 2337 2356 9556 8126 6856 8656 7746 7316 4696 4316 3776 1666 4986 2326 2486 3096 1446 1545 9455 8985 9675 7245 6485 7455 6685 5835 4545 5625 5095 5155 4735 3685 4255 3985 1815 1555 1714 9634 8444 9244 9164 8704 9094 8134 6854 6374 4234 6174 4854 5524 5314 2964 3514 3234 2694 1984 0474 0224 0623 9194 0294 0453 9694 0194 0754 1253 9573 8963 8293 7513 8073 7333 6383 6533 5003 4903 3713 5393 5173 5143 4383 3503 2933 3913 2833 3023 2663 2893 2633 3383 2303 2623 0673 0663 0813 0343 0303 2052 9842 9252 8102 8632 8282 9322 7672 8362 7612 7562 9342 7532 8662 6842 5692 6852 6852 6182 5132 5952 6502 5792 4892 4992 4782 5852 4812 5362 5362 5612 6512 5372 5462 4692 4642 4882 3202 3132 2352 1882 3022 2762 3032 2232 3372 2692 3202 2542 2562 3532 1162 2302 2032 0932 0292 0472 0402 0372 0381 8891 9452 0511 9811 9611 8551 9071 8961 9031 9871 9641 8991 8981 9131 9941 9571 8641 8731 8221 7951 7931 8861 7601 7831 8021 7641 7731 7351 8081 7161 7501 6831 6661 6751 6551 6891 6771 6761 6451 5581 6391 6591 6791 6141 6171 5741 6451 5841 6901 5961 6271 6831 4971 5381 4431 5561 4341 3441 4481 4491 5681 6351 6081 5331 5021 4931 4341 5141 4871 4451 3821 3061 2801 3701 3221 3181 3971 3331 3411 3361 2381 3231 3461 2921 3201 2991 2751 2721 2821 2881 3001 3241 3151 3061 2861 2611 2671 2601 2241 1921 2311 1751 1431 1351 1211 1601 1211 1321 1251 0891 1471 1081 0631 1621 0701 0661 0901 1021 0151 0661 0351 0811 0611 0711 0109681 0641 0661 0461 0401 0239999321 0979949749659631 0091 0671 0049839749399811 017980938890947878859943937970914864933873881855901860916843852931792833850862845842876820836853834859857867784858826850814841873834756891856826786799751870853859841793808771789766839792771784793777810806779816718729707770727803752708733718645768673691664701698768715662727672686695701707687721715699727685635689696684671662646680664646662720644676646671664656657682695671654680663606657651569621675667676677642606643695652629669613667632634633600588605554583541619555624578594619622600608557603603537595581570562547612573589567574543536549586601551606580595529547600588571579522543537543540529587491509537556479522490497523518500512483496547470520528538523522490522543491507523487514517527487496478527515526516512501484520499469506446459450474464475468511490476480549452470508472527478462506515507483503491473464532474494507503476516572541522507485490487452480497479469502430493545475479493455538502482495512495538461467438472483426461429450454477416428416453414443423424439407434490473434475394426416405390424420418448382425374397371402399363354418398379362402359372386379377361403384370395373388405392407397383408415397342367381382359342337319330358345333362339348359345333343343357338349346320391398362373364382349378336388336318337354353323325342316317339331323331345331363381319343343299350353345369327345313317343356349333351365349315298354324325326329335319343306319342326344290305303284298284297323274321296317309278293287326349305269292278299294284315265294304275283317251296298313301302299264304321314289291298332321279330290275282285300265292295291340305300311264279406 187100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

5 039 042000000021 594 7890001 256 424 911000000000706 249 1290000796 223 71800001 864 691 10100004 199 609 75000027 893 876 98400510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G22G24G26G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.5 %239 609 18298.5 %1.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.4 %239 336 76298.4 %1.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %272 4200.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %121 667 91250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.8 %235 643 03096.8 %3.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

33.8 %82 329 58433.8 %66.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

14 574 545202 022119 485247 492172 036184 592204 037295 157133 476233 81497 37386 275120 336143 83273 576179 334118 554133 938176 393258 592269 307259 541311 901247 288407 477713 41637 9651 263 16554 90653 559105 137120 38854 039150 04155 57855 61389 746125 76333 472198 5322 817 627138 234125 819228 453186 280369 015334 915505 865888 80982 660117 292101 615136 49256 847110 033113 51176 206338 29379 250178 518215 853 266051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.86%99.88%99.88%99.89%99.89%99.89%99.89%99.88%99.88%99.88%99.89%99.88%99.89%99.88%99.89%99.89%99.88%99.89%99.88%99.88%99.88%99.93%99.85%0.11%0.14%0.12%0.12%0.11%0.11%0.11%0.11%0.12%0.12%0.12%0.11%0.12%0.11%0.12%0.11%0.11%0.12%0.11%0.12%0.12%0.12%0.07%0.15%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped