European Genome-Phenome Archive

File Quality

File InformationEGAF00001999512

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

3 008 9191 040 792608 200513 537476 921472 889471 113484 594507 337546 394586 324651 098730 099819 083925 0191 054 0651 198 3651 381 4511 613 2901 899 5712 267 3712 765 3573 450 9514 355 2415 602 2727 206 7359 305 71312 003 00715 397 40519 541 01024 501 13030 285 12436 908 58944 327 19452 375 96760 981 99169 905 43178 931 03487 751 23896 152 890103 782 852110 450 066115 914 118120 018 913122 693 200123 783 831123 264 413121 241 023117 809 358113 176 070107 518 861101 006 73393 857 25086 314 61178 549 96070 781 39063 132 98455 801 23348 903 73642 461 10836 577 05131 232 75926 487 28322 287 75018 638 60815 474 93812 779 20210 488 4668 582 6676 990 6625 681 6254 605 2983 733 9953 027 2812 458 1092 006 1621 641 2951 361 2741 135 030960 363820 997709 872623 360553 105501 752455 517420 473390 524366 868346 351329 395313 589298 507288 011275 246264 746257 046247 202241 102232 105223 814216 625209 273201 664194 460187 955180 805174 804167 860161 792156 148150 354146 732141 749138 021133 875130 626126 178123 076119 252116 539113 742111 804108 295105 809103 439100 96298 25596 30094 38292 37891 49289 50386 99084 94483 74081 71180 09179 34676 89075 59774 17272 60671 09770 34268 54867 26066 02064 73263 42062 44560 60359 60957 48356 41655 34754 99853 60152 57851 25650 29449 91749 04048 48147 25647 07246 99845 90844 91643 98043 53342 51142 14841 22840 69740 46439 73038 97838 15637 11836 75536 36035 72634 68634 42533 99433 37433 17032 42231 70531 18230 42930 20729 73029 59529 44828 89028 77428 16928 05427 03027 21326 49425 93025 70425 30625 04024 63924 68324 06623 73923 54223 18622 84922 71222 25922 35421 75321 91321 40321 65421 33621 23620 63120 53020 33220 08119 58519 79019 41719 62219 23819 12218 77718 49218 52718 31718 13918 05117 70817 83017 41617 05117 08016 84816 32416 41716 17616 28215 75115 62715 62515 23515 27014 91015 04214 58314 54814 40214 61514 37214 32713 99813 94113 84713 94313 63913 55913 20113 45613 54913 24113 22412 85312 93412 88812 84712 94512 76212 70912 43312 40112 09312 10411 84811 97211 94311 64511 67011 21211 31811 11110 83910 75610 58310 71510 54010 55510 34210 22510 27110 09910 07610 0239 8239 7779 6779 6569 7049 7019 7789 6829 4319 1879 3039 3629 2399 1268 8948 7178 6998 6648 8668 7258 2588 5268 3818 0598 0538 0868 0117 8577 8157 8667 6367 6647 5997 5847 8017 5737 3827 4947 4067 2757 2577 2497 2207 0877 0027 1827 1286 8316 7106 8516 7776 7866 6226 7516 5676 5726 5246 4946 2976 4106 2566 2476 2916 0936 1146 3986 1206 0595 9746 0726 0595 9916 0215 9396 0185 8985 6695 5785 6535 8285 6435 5725 6565 5295 4015 4685 5165 3745 2795 3645 2045 2675 0765 1305 1585 0525 0624 9225 0314 9734 7914 8014 7634 8264 7484 7694 7804 8724 6744 7454 7154 7334 7444 7044 8214 7535 0094 7264 7394 6764 5264 5104 4064 5534 5884 5294 5514 6184 4334 5184 5014 5824 5164 6274 4554 5784 5544 5124 5244 5804 5844 5404 5764 3854 4404 2984 4424 3084 2984 3114 2644 2084 1004 1064 0984 0584 0844 1614 0014 1534 0093 8863 8933 9053 8193 7933 9393 7543 7363 7733 7623 6103 7523 9373 8413 7163 5773 6953 5773 6463 5393 5613 5913 5993 5793 4333 5113 4633 3013 2863 0763 0963 1703 2143 0823 2343 1893 1513 0943 0773 0483 0273 0192 9843 0042 9173 1133 0443 1232 9473 0092 9742 9492 8812 9202 9282 9152 8212 7612 7342 8252 7542 7112 7632 8172 8202 7542 6532 6102 5902 5292 6142 5432 5592 6682 6432 6542 5652 5052 4412 5992 5042 5082 4072 3722 4192 2952 3002 2912 4012 4132 2702 2922 2702 3382 2362 2132 2032 0912 1302 0952 1452 1042 1172 1222 1061 9811 9982 0542 0062 0772 0351 9791 9551 9471 9101 9491 8601 9351 8141 9751 7891 7621 7511 7171 7531 7291 6731 6221 6341 7331 6801 6321 6781 6901 6091 6791 7701 6971 6231 6061 5881 6171 6411 6981 6861 5841 6571 6431 7081 6651 5611 7321 7121 7341 6381 5491 5221 5431 5801 5631 5031 3661 4551 4101 4811 4531 5241 5051 3831 3991 4451 4361 3971 4311 4211 4271 4041 4271 4681 4041 4051 3841 2661 3271 3281 3731 3351 3571 3371 4221 4021 2861 3011 2751 3311 3181 3311 3301 2571 2701 3811 3721 3421 2821 2761 3521 2701 2311 2121 3341 3551 3331 2681 2911 2931 3151 3091 2561 3331 2671 2141 2031 2791 2131 2481 1511 2711 2041 2821 1831 2051 2581 1681 1811 1951 2351 1791 2401 2021 2101 2041 2071 1641 1711 1121 0841 1201 1911 1361 0961 1131 2041 1501 1231 2081 1181 0371 0381 0591 0561 1001 1361 0601 0821 0841 0481 0431 1119841 0921 0501 0991 0901 0121 0431 0591 0081 0161 0089929671 0099981 0381 0149619471 0301 0369479999529749949669369979329669559709208998989439459329569319259089119611 0799399451 0011 045900930886928954923856856888873937898846883843874913891873800869843875865867852817815782822806835857826865809764843808773760891799771828797720740745703693764762806737784739690703731801766756796781739644708735754735704787722718755751752810722739765672668701742827738741729703695762677674695702662632648623644660638674727649679627589655640587583594576593624608659683661574596626650610607646685634598616620617617638570580715524580636592570547545585603592673643607626611543504536576566590575550594550579571583533563638537527521543531551575583611578544553535587559488534569511511578548545532533521501533529522570568545545511540513528525530489516532527585522697 990100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0018 537 8570000489 025 6700006 577 396 74400000000003 459 049 48000003 948 911 77200009 930 408 191000017 653 022 0620000101 706 923 34400510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %947 661 06299.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.2 %944 894 23099.2 %0.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %2 766 8320.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %476 103 56050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.4 %927 281 63297.4 %2.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.8 %93 505 1539.8 %90.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

39 133 7631 504 5151 183 4261 657 8841 018 031839 0941 105 1381 008 827451 4291 311 017802 141808 5711 068 5831 008 958518 9081 181 677692 405653 153916 558934 783596 0601 082 632966 378939 6431 520 4112 178 819224 3013 982 791302 911285 781564 577551 022175 476767 543264 821262 598513 715609 160135 421905 63314 469 616813 4624 211 4911 111 540797 419653 630283 340289 057722 931925 1584 741 2991 571 2621 156 0961 095 3491 035 6343 309 0561 260 4211 201 7591 048 377943 760841 050 2493 7272 8372 9773 1353 2563 0273 0682 8042 948919 5810510152025303540455055606570Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.71%99.7%99.71%99.72%99.72%99.71%99.72%99.72%99.73%99.72%99.71%99.7%99.72%99.71%99.72%99.71%99.71%99.73%99.67%99.7%99.68%99.72%99.38%99.69%0.29%0.3%0.29%0.28%0.28%0.29%0.28%0.28%0.27%0.28%0.29%0.3%0.28%0.29%0.28%0.29%0.29%0.27%0.33%0.3%0.32%0.28%0.62%0.31%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped