European Genome-Phenome Archive

File Quality

File InformationEGAF00008414043

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

98 422 58382 209 87081 310 28390 418 718102 858 113118 152 151135 458 998153 741 690171 185 917185 252 407194 015 367195 765 158190 092 300177 698 455159 777 255138 645 176116 389 60994 536 46974 518 03157 144 71342 757 00231 217 79222 325 79415 704 74310 886 7507 479 3165 125 5913 521 8782 442 1861 735 1161 267 337957 343751 619609 572511 574441 855389 930348 852315 956285 990262 149243 066224 744208 300189 631175 697164 665151 743143 520136 154127 252119 519113 041107 410102 55499 12495 42091 90488 94184 81281 19779 38377 72375 71873 20971 38569 28968 10564 90762 91260 38758 39856 67255 81853 38450 82749 52948 15846 43844 73143 31441 13439 32837 38535 99534 43933 41232 91931 87130 49029 00427 89927 43226 10425 25224 03323 83023 01222 03121 11020 90120 46619 70719 13118 33217 98817 56117 19416 95116 86516 46216 41716 31115 89615 49714 91114 51814 27113 66913 60313 33912 88412 89612 77112 67312 33512 19511 96211 66511 44811 34211 14211 11310 88810 60110 57110 47810 0539 92410 0239 8349 5579 4889 6309 4499 3448 8849 1448 7548 5418 6248 2428 2398 1687 8997 8688 0707 8797 9557 8407 8637 4147 3997 2297 1907 2237 2087 1917 3607 1616 8326 7626 6846 5746 7036 6676 5166 2896 2686 3276 2346 1486 0086 0515 7695 6205 9595 6475 6975 7295 5055 4645 5365 4265 2145 3765 2165 1845 0635 0204 9734 9055 0304 9504 7274 8304 8024 7404 7714 7324 8154 7014 8004 7674 7164 5504 3964 4634 4194 5424 4064 3974 2984 2694 2134 3484 1134 1324 0224 0363 9623 9043 9173 7903 9713 8413 8783 9213 8663 7733 7223 7163 6873 7163 7323 7623 6563 6273 5783 6583 6383 6343 5903 3733 4213 4283 2763 2193 4243 2233 2123 2013 1873 1943 2293 2003 3093 2633 1703 1513 1093 2143 2873 1473 1443 2393 1473 1723 1443 1013 0773 0542 9892 8972 8752 9912 9972 8752 7752 8472 7392 6612 6122 6862 5992 5752 5922 5732 4482 5902 5302 4212 4782 4872 4462 5592 4492 4752 3922 4462 4672 4482 3772 3372 3392 3172 2902 3012 1962 2442 2102 2342 2042 0682 1142 1912 1302 0852 0952 1452 0992 1622 0312 0582 1652 1322 1292 1321 9951 9522 0522 0192 0001 9821 9491 8821 9751 9391 9281 8881 9151 8241 9111 8891 8451 8461 7351 9211 9281 8451 8051 9261 7971 8121 7191 7631 7691 7651 7681 7311 6391 6781 5211 6121 5751 5021 5931 5351 6371 6561 6341 6681 5771 6111 5711 6031 6161 6781 6281 5661 5761 6451 5721 5701 5441 5811 5391 5631 6271 5681 5751 5431 5401 4691 5401 5781 4851 5351 5341 4941 5311 4401 5901 4791 5191 4371 5201 5121 6051 5221 5241 4221 4721 5121 4901 4761 4281 4241 4401 4071 4361 4461 3741 3411 3531 3121 3531 3211 3581 2461 2151 2711 3251 3701 2641 2291 3301 1941 3011 2861 3121 2251 2571 2651 1701 2161 2661 1831 2551 2441 3341 2971 2281 2731 2091 1851 2251 2721 2201 1311 2001 1471 1391 1181 1221 0621 1151 1181 1651 0821 1071 1471 0531 0969971 1951 1231 0831 1011 1301 1181 1091 0691 0229961 0191 0221 0841 0681 0589709979341 1081 0219571 0051 0119929811 0679731 0159929641 0159679559769641 008977958982997911936987913876871941876817878958861880913880886884894838837868841852880905878836798828847816854810867769858812823782824820743830765771764786836828811855818767830789816774793821824769756740751777752763787783793730729772758758766756705781783800792803777766781767760768719760755744757720717708696701740901769740723739742707689766675747710707728699729709705739691676734710685671683743692669663684663716668618695732714728714694709637724696693709709753691749685722690753736765701709723640638699650671699684656627669661659630691653642622637622660641647619590656662624657656606634621648601625608596606618593594609539595549599608558578555596571562582580573614587548605635580563597562627606601568581569579606569548617592550496563573599580554550560558569528559475466538510498474491496500466490482478498490496517487492497485550478496502525519501486506465470465478494477478467485439461452524428476478503489478462461474449434511436445450466427445457425448446458439444462424440422405468423435458433462457460478469478430417419434448422454463443427394398425441392422368418395395376410389395406395360410388373399390374400423435432409430436413399376401417393385354369390380381384392376379370375367381350365355378371355372385416324342396370393369391374366372377369360323319356367351352325394390369388368334358401349362369347390329366358352354370339346357346368325301343357338341340315363316331311344314296331305339301308324309326323302320302298393 938100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 847 820000000024 629 7730001 423 131 419000000000762 472 7940000885 759 84500001 947 512 23700004 264 627 43900024 383 363 65300510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G22G24G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %221 727 83999.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %220 884 35499 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %843 4850.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %111 573 99050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.3 %214 896 77296.3 %3.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.2 %22 690 02610.2 %89.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

10 510 689212 756129 709252 822193 104200 850226 719311 105133 842229 239109 88194 778132 635148 30881 085180 799140 153154 244190 189256 163265 560259 888331 677253 725402 650621 21343 5211 059 00366 71162 907109 275127 21659 260146 48663 74462 42798 354131 32638 551193 8402 710 072138 608137 006224 072191 619348 413292 499482 552700 51784 124110 67596 801123 66859 134102 593108 09380 306293 52385 309174 144199 718 136051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.62%99.6%99.61%99.6%99.61%99.61%99.62%99.61%99.61%99.62%99.61%99.61%99.6%99.61%99.62%99.65%99.64%99.6%99.68%99.63%99.62%99.61%99.69%99.64%0.38%0.4%0.39%0.4%0.39%0.39%0.38%0.39%0.39%0.38%0.39%0.39%0.4%0.39%0.38%0.35%0.36%0.4%0.32%0.37%0.38%0.39%0.31%0.36%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped