European Genome-Phenome Archive

File Quality

File InformationEGAF00000791873

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 299 1471 796 4761 610 3461 572 8641 611 4441 738 0951 964 8442 322 7682 845 9973 581 1004 566 1635 866 5627 444 0409 333 12311 469 97313 804 65916 264 12618 766 16921 236 68823 652 52126 103 80928 601 70831 257 76034 212 81137 549 54741 408 83945 882 26250 973 69756 717 90863 026 13069 818 36676 967 43384 247 81691 470 71798 298 520104 520 304109 904 628114 142 436117 114 303118 590 682118 515 261116 870 209113 712 045109 081 136103 191 77896 258 98788 503 72980 291 71371 821 05263 400 77855 166 25147 329 53940 095 81433 533 88627 690 22322 560 10618 178 57414 494 34611 434 0958 929 4056 926 6905 340 8874 101 5903 151 5072 426 9781 880 8821 476 0641 178 938962 312806 682694 540612 212552 036507 102474 114451 008430 953414 129399 528387 574375 446364 626352 974344 655335 411326 476316 650308 006299 134289 393280 117270 149261 603254 330245 236237 563228 192221 626213 397205 042198 494193 051186 716180 859175 664170 345166 047161 439156 359151 851148 351144 650141 122137 890136 143132 387128 941126 375124 054121 643119 431116 410115 159113 788110 388108 763106 541103 672102 254100 03598 26397 15795 27892 96791 15288 92386 69684 43982 38780 70078 73077 55775 67474 33272 50971 36170 03368 48866 87565 41264 12363 07561 39960 28758 23357 51656 31755 67754 28853 27352 51551 00750 30349 16848 30747 76847 23746 18844 92444 45443 13243 05542 36241 31940 69539 42739 36738 61037 95437 23436 39235 81035 66834 96634 48733 50132 69932 23031 94331 75530 76430 28529 93429 36729 01528 03827 65127 38026 94026 91526 46825 64725 75924 64724 39724 38123 69923 35622 85522 99322 52822 11621 58121 51121 14020 68920 37920 12920 12519 81019 72419 02918 67418 24718 04217 82317 78817 47017 07816 82516 82016 78216 77416 62516 41316 15316 04515 45515 48015 12515 30314 98214 97114 66614 50814 37814 13013 94713 73313 55113 30813 06412 96512 74912 65012 44112 32912 19312 22911 98311 84411 64811 41511 55411 61811 30411 47911 32711 26811 20611 00810 79210 54810 39610 24110 0189 8289 7109 9019 5789 5269 8269 5069 3949 3068 9189 0969 1158 7598 9248 7968 7368 7788 4578 5658 3298 2038 3378 1067 8867 8297 7407 8647 4817 4487 7587 5647 3467 4927 4017 1726 9816 9826 8286 5656 6076 6226 6306 4786 3686 4186 3376 3676 4166 2026 1956 1006 0806 0415 8485 9395 7715 7095 6365 5885 5215 5025 5315 5395 4055 4165 4125 4515 4205 2965 1155 1615 0555 1675 1365 0415 0684 9424 8804 8884 9744 8394 8664 9584 8084 7714 7464 6914 6364 6324 6084 6714 4604 5824 5404 4554 3984 4284 3934 3944 2444 2694 1484 1624 0774 1314 0594 2214 1424 0613 9403 9693 8223 9503 9623 9893 9053 9323 9583 7913 8283 8983 7663 7223 6173 6853 5643 5853 5443 5283 5613 5383 6283 6303 5523 5053 5333 4883 5473 4583 3773 3703 3983 4853 4493 4363 2593 2503 3443 3303 3383 2143 2303 2283 1343 1593 1963 0583 0113 0443 0192 9802 8552 8862 9902 9813 1433 0932 9592 9462 9292 9502 9212 7922 8162 6902 8512 7402 8572 6632 6762 6152 5372 6142 6652 4792 4052 4072 3682 3192 3662 3502 2452 2152 2582 3152 2452 1882 1942 1462 0522 1682 0442 0772 1112 0921 9611 9101 9602 0601 9381 9001 9561 8912 0061 9821 9351 9791 9461 9441 9561 8881 8741 8811 8721 8591 7921 8631 8861 8391 8031 7871 8511 8341 7831 7121 6761 6961 7701 6761 7011 6831 6611 6531 6641 6531 6481 6561 5731 5641 6181 5961 5921 5521 4141 4011 3591 3771 4031 3981 4631 4671 4101 4481 4481 4461 3821 4181 3511 4281 3511 3361 4121 4541 3801 3531 3771 3971 3761 3281 2821 3411 2421 2661 2691 1941 2941 2871 1961 1931 2041 2061 1791 1201 2121 2261 2111 2491 2091 2731 2591 2571 1831 1721 0841 1261 1311 1341 1591 1391 1191 1441 1741 1821 1611 1751 1761 1281 0761 0641 1141 1361 0481 1211 1571 0611 1041 1161 0611 0401 0901 0641 1041 1001 1201 1221 1321 0771 0931 0501 0821 0681 0461 0781 0631 1361 0111 0691 0839991 0671 1081 0561 0419731 1051 1001 0251 0961 0721 1281 0961 0801 0801 0729819549819769671 0531 0659451 0469731 0151 0441 0221 013966972880944897964948921936919868833955905920911893935916924876833857844841852863862847870906778796928881830830816784844769850782807791793800740823799837875755769739754748826794727787797806760787810824805850793753781699738768730707736762734669626715703691700697677709698748649653711690642696670697712660689637725659636696647681723714663687592681629699652638613631657635640601618580622630566595622644632632593553574562584594612583569614594571559597515497572578590554549543501505538507568511532500560571551535549554531536517522534548492535519451553467512500563488491513552553456497469511517543576500521449490517532485533552541523462525440473447464450477468494468455451452432457444458459492454446426443447464470460458439447447427424456443412442420441441471482454434390360423461412444490433513455436403441481431418387470466422439427411372400406359378396426346410372379376433424386335390389421436439403422382350382413402394402398417371422380394375400403367400353423418396386371385394381390353380383369388368376368337343334371362352345341336367519 018100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

38 220 062000000000000415 600 680828 214 6361 900 172 4221 930 919 910079 414 47800100 013 83018 005 100340 164 863193 784 555791 610 310354 021 439638 449 399715 842 7781 428 155 541142 654 3651 443 015 479402 403 1832 997 376 2795 582 213 4242 304 883 7204 167 946 4036 427 551 26886 956 255 1260000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

96 %923 482 20396 %4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

95 %913 419 31895 %5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.1 %10 062 8851.1 %98.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %480 787 55750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

93.1 %895 538 64293.1 %6.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

1.9 %18 469 3381.9 %98.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

80 553 9613 589 732277 8548 482 578119 093103 24769 40027 78721 66227 1718 692 14217 06416 25330 93022 56325 33828 37847 59551 11595 74786 65716 97424 28521 39418 06619 63917 33529 23828 19349 26241 84853 98869 030105 955112 364183 758207 728464 990387 692937 651198 425273 5323 322 56452 08236 68026 12247 57628 08117 00218 45218 84219 17417 44618 26717 74833 90323 74525 94530 95032 981852 139 940051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.32%98.95%99.43%98.87%99.4%99.48%98.87%99.11%98.96%97.67%98.82%99.14%99.54%99.26%99.39%98.6%99.03%98.24%98.19%98.85%97.54%98.77%91.92%99.55%0.68%1.05%0.57%1.13%0.6%0.52%1.13%0.89%1.04%2.33%1.18%0.86%0.46%0.74%0.61%1.4%0.97%1.76%1.81%1.15%2.46%1.23%8.08%0.45%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped