European Genome-Phenome Archive

File Quality

File InformationEGAF00001767573

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

382 136 947208 936 343139 564 539102 297 43379 465 12464 159 38453 133 00344 901 78538 530 58133 528 87829 494 58326 214 84223 458 76021 143 25419 145 00417 434 89915 999 08714 694 03713 584 48112 577 14611 691 59610 890 33210 166 4529 519 6578 928 4788 390 9747 888 9507 432 0827 040 7496 656 2326 305 7305 985 3695 692 6585 410 0435 149 6984 905 2744 700 2034 482 5064 283 3994 092 9733 918 5823 750 8153 600 0253 469 0503 333 1043 211 6623 087 5832 974 9402 879 3092 777 5592 680 3232 590 3882 506 1952 426 3122 347 0822 272 9542 199 0962 129 2722 065 9131 999 9051 936 7511 878 7071 819 0851 764 5941 707 1901 654 7411 606 3951 560 4891 517 1431 473 4821 428 5581 390 2471 352 3881 313 4451 277 9541 241 7831 208 7961 175 2441 143 3041 109 5071 077 1361 048 1521 022 558995 236971 837946 572919 968897 494874 827849 227827 838805 537783 037762 221740 751720 840704 387684 151667 317651 728635 022622 685608 161592 892577 574562 643551 090538 775524 418512 914502 439491 677479 646465 697454 896444 354433 306423 670416 081406 685398 859388 637380 197373 692365 272358 201348 602341 627334 437328 085319 720313 933306 457301 554293 293288 690282 595277 022271 015266 279262 707257 941252 473246 776242 328237 302231 467227 819223 508218 450214 611210 510206 659201 576198 695194 025191 053187 631182 612179 302174 915172 140169 187165 570162 610158 814155 661152 661149 135145 517143 824140 467137 764134 885132 137129 487126 636125 233122 538120 109117 835115 561113 827110 853109 803107 136105 508104 787103 173101 56399 93197 67195 50794 41892 73991 36089 26887 96185 99384 88484 00682 07980 87479 63778 80277 12076 31574 65173 51771 81071 38270 63668 88767 60666 44665 31764 58163 52462 53461 73561 01859 58258 82557 63356 75856 50854 85553 84553 31052 33252 09551 06850 28549 69649 22648 44047 34446 78346 62445 79745 03344 02243 13742 66542 11241 52241 02540 22839 20438 28638 04136 83336 26235 83234 90834 51333 83933 45932 78732 65732 55631 93831 62831 56930 83930 19429 95529 35228 79628 44428 15127 83127 72226 90726 44526 37226 02425 55324 82324 68224 45823 68823 39622 69322 04521 46221 18720 83620 73020 60519 82519 69419 47319 25718 95318 79318 34918 18017 82917 96217 58517 12716 80916 70516 25716 23716 11415 77415 59715 05715 03514 68814 47514 24414 04913 45213 44913 13812 88612 79312 60112 68412 43212 58612 06711 85811 82911 55311 37611 22110 96211 08010 73010 87910 66110 48110 54810 50010 09410 0359 5749 5039 3529 2158 8018 8488 7938 6778 6798 6148 4398 2408 0667 8967 9597 8577 6507 5447 5257 3737 2977 1407 0226 9106 7406 7496 7726 7406 3596 5816 5126 4576 3226 1026 0725 9095 9145 6635 6935 6275 8355 3875 2325 2785 1785 0635 0094 8694 8324 8134 8414 7674 6334 6554 4734 5744 2944 2794 3104 1824 1574 0604 1624 0533 9613 9903 8893 6933 8543 7053 5993 7333 6823 6483 6343 4913 5053 3293 4613 3213 4563 2583 3283 3243 3473 2093 0532 9762 9422 8882 8692 7812 7512 7382 7142 6902 6172 6472 6082 5952 4512 5142 3502 3332 3432 3052 3702 1752 2572 1982 1952 1122 1532 0792 1102 0762 0441 9311 8971 9401 8641 8741 8601 8811 8831 8791 8541 8261 8301 7731 7631 7201 6491 5451 5981 6471 6281 6021 6371 4861 5211 5751 4711 3741 4371 4351 3261 3271 3401 3341 2691 2681 2021 2151 2161 2001 1731 2011 1551 1551 1241 1061 0651 1221 0781 0339809829478938779619268829049159099078768588768508618548358187948217667797717837557127147226586836826436466476416266116136295836166195986126116046045615475124774754814294404204243954033934083503903823773433423213392943112892432702482172352132322251841881851861871811801841561711561491451591281331351371271251021351221161501171171141311121111121099511111811312412411813513014611612411497124951101088172869076807873748074917876775870617559587569727867677254694866596760544249425543494941423832515847454536524138404228283329304036232719202936372333282833282524272028294022311822191517181812141312181414142115111211131213126885389111016986516911231312913635645456463695693361213223111112332315112111112111111111111111311111121111121111211121111421122112121112121334212111121111811676100200300400500600700800900>1000Coverage value1101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 150 66700000038 120 286000684 994 6860000000000417 480 8410000512 953 32000001 382 723 03700002 698 488 276000016 127 950 79700510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %144 750 755100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %144 712 69299.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %38 0630 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %72 400 20550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.4 %141 049 04297.4 %2.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

13.8 %19 956 53713.8 %86.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 194 198149 70360 226205 11466 79567 643288 04990 18668 501107 18144 58536 540123 98145 62732 80563 39038 03341 53671 59747 90245 25863 21684 89752 103103 284143 04424 252348 34925 18321 57764 91537 16643 64946 02928 14125 26243 63738 95122 03564 1741 168 57173 23076 044115 81686 526144 933149 715168 011445 04637 34165 85840 84076 35530 37671 97343 24738 234144 11935 85187 801138 561 451051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.97%99.97%99.97%99.98%99.98%99.97%99.98%99.98%99.98%99.98%99.98%99.97%99.98%99.98%99.97%99.95%99.93%99.97%99.89%99.96%99.97%99.97%99.86%99.73%0.03%0.03%0.03%0.02%0.02%0.03%0.02%0.02%0.02%0.02%0.02%0.03%0.02%0.02%0.03%0.05%0.07%0.03%0.11%0.04%0.03%0.03%0.14%0.27%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped