European Genome-Phenome Archive

File Quality

File InformationEGAF00004199692

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

888 782699 633653 701650 193668 251718 018785 902889 2781 038 3181 267 8191 596 4572 064 0222 701 9833 504 5144 523 5385 725 6027 034 8698 409 6489 785 82411 061 96212 195 90913 106 98613 797 89914 299 11014 650 27414 931 40015 262 95115 755 72216 598 18117 864 62019 708 68122 190 77025 410 17629 411 23234 191 61139 630 43045 693 90852 245 27459 084 71466 072 97673 016 40579 667 17485 812 03491 362 85095 981 97199 759 258102 377 412103 934 550104 377 222103 726 867102 058 84499 421 03696 032 09891 891 89287 207 98782 160 45476 777 01271 298 42465 778 24060 302 92254 971 95849 826 38344 970 82640 385 76236 164 26632 236 19428 657 78125 409 92022 469 76619 830 72417 468 76115 373 64013 495 24111 855 50310 409 2129 119 0878 003 5337 015 9366 156 0445 406 5474 751 0974 179 2753 676 1483 241 1672 855 1492 519 3552 230 3981 973 0111 755 0661 564 3301 398 9701 253 9261 126 0651 015 095917 299832 090754 183690 805630 578576 627532 322496 474459 798428 476398 052374 207352 889333 375317 285301 716287 387274 463263 665252 305242 978234 606225 849217 720210 710204 689197 577191 347184 212178 469173 709169 046164 557160 648156 154151 783148 959145 862143 154139 460136 076133 585129 890127 675124 578122 387119 653116 551114 732112 822108 793107 193105 751104 048101 81299 13597 22095 54194 18392 69390 59789 17687 19485 55584 59482 59581 14280 17478 57877 37775 53474 78073 51872 46570 30669 25667 82266 41566 09464 91163 62663 42261 59460 16759 07157 93456 94156 25355 59554 28353 98753 04052 41952 02251 46950 54949 34648 80549 05548 32547 99647 15846 59646 32845 38745 47744 32943 55242 75342 86942 85542 27141 32341 36840 89340 47139 86339 41038 98538 71838 27337 78037 63836 85336 77336 58436 19936 41935 60234 63034 60734 18733 71233 55333 03532 10432 06632 02231 41331 41130 95730 58530 31129 85329 13628 91228 55428 42428 26127 75427 63227 26526 81026 69526 49726 46126 48925 98825 12124 90224 44524 54124 36923 90623 35723 26022 71222 29322 28621 70821 53421 45921 53621 08620 93520 51020 21320 17619 80719 73619 60019 32019 21218 88418 85718 45618 11317 80917 50817 58517 27816 94316 86716 46016 41616 19915 91715 67115 72315 58515 23415 19515 09715 13714 63714 74914 67114 47414 51114 29614 22114 09213 76713 69513 81913 39213 27613 04012 67212 75512 86612 60012 16712 41412 16911 85712 39611 90611 99311 92811 72711 55811 47011 47711 17511 45411 12210 73310 70810 66810 63910 47710 46410 46010 40810 30910 09610 17910 0759 7789 9189 8039 7359 7859 4959 1879 1669 0799 1068 9949 0569 0968 7998 7728 5998 5508 3788 5688 3758 5238 4738 3408 1918 2198 2668 2058 1618 0547 9847 9207 9007 6397 5777 4867 4737 6457 6217 4957 4237 4237 3287 2887 3567 2427 0787 2327 0627 1816 9777 1547 0286 9917 0047 0656 9836 9286 5916 7596 7626 6266 5646 5446 3556 4566 4026 5286 4066 2456 2596 2996 3516 1456 1766 0976 0835 9886 2325 9886 0765 9946 0255 9645 9576 0546 0226 0345 7905 7975 8325 7705 9095 7455 6355 6405 5145 6175 5075 4935 5115 4745 3945 5245 4875 5045 2635 4615 3905 1995 3535 2405 0385 0655 1275 1645 1155 2015 1445 1024 8744 9164 7704 8494 8314 8744 8004 7514 8284 8024 7004 7264 8084 7214 5894 5474 5034 5854 5884 6064 5194 4604 4594 4684 2614 3814 2764 3234 1724 2334 1264 3004 1884 1354 1964 0804 1454 1114 1584 0853 9794 0744 0313 9944 0374 0324 1003 9213 8573 9564 0473 8673 8483 8033 9244 0333 7443 8423 8713 9013 8243 8333 8543 8673 7363 8753 7403 8273 7073 7993 7603 5703 5933 6353 6483 5683 4663 4983 5953 5533 5653 5743 4293 4703 5133 4313 2743 4063 2623 3143 2223 2503 3843 3383 1453 2033 1773 0253 1283 1963 1703 0853 1673 1973 1983 1222 9553 0813 0833 0933 0843 1673 1142 9763 0523 1162 9892 9902 9953 0012 9293 0232 9223 0543 0232 9992 8523 0222 9232 9102 8902 8872 8442 9372 9562 9102 8642 8892 8732 9662 8532 8222 7502 8832 7452 8582 8512 7372 7282 8232 7162 6992 8012 8132 6382 5932 7562 7212 6382 6562 6432 6852 6062 5742 5692 5212 5572 5942 5312 4432 5502 5492 5422 5412 4862 4792 4992 4892 4982 4452 4232 4932 4832 5062 4112 4632 5782 5772 4792 4732 4282 3082 3742 3862 3472 3912 4112 4272 3762 3472 2892 2792 2792 3632 2802 3392 3382 3402 4042 3622 3292 3402 3022 3032 1652 2332 2042 3092 2052 2302 2192 2922 2672 1982 2382 1662 2402 2702 2272 2382 1842 1082 1672 1192 0382 0202 0532 0522 0321 9812 0722 0581 9872 0012 1092 0042 0562 0112 0101 9471 9851 9531 8971 9241 9381 8391 9561 9181 8961 9812 0011 9661 9551 8841 8931 8271 7931 7871 8141 8871 8201 9061 8491 8481 8431 8111 9181 8061 7961 7121 7231 8301 7201 8691 7571 8321 7541 8441 7351 7151 6811 6931 7061 7051 6441 6561 7071 6811 6901 7491 6631 5821 6811 6461 5991 6231 6941 6111 5661 6081 6171 6411 6641 5401 6231 7011 7011 6281 5851 6071 5691 5991 6021 5541 6691 6741 5961 6721 6521 5691 6791 6851 6951 6011 6081 5461 6021 6751 5911 5661 6511 5541 5761 5891 5971 5451 5491 5771 5991 4981 4681 4401 5171 5181 6311 4851 5131 5311 4941 5411 5271 4921 5001 4541 4801 4981 4381 4571 4711 4731 4001 4341 4221 4301 4361 4901 4191 5421 5331 4511 4151 3671 4331 4041 3571 4461 3541 4501 4401 4101 3831 3921 3551 3761 3571 3251 3751 3751 3481 2991 2691 3901 3911 2711 3111 2251 3141 2371 3211 3431 3451 2771 2701 2681 2491 2131 2121 3131 2861 2721 2231 2281 2171 1881 2381 2791 2681 2271 2461 2691 2231 2251 2661 2591 2121 1911 2841 2671 1951 2661 2301 2331 2921 2231 2411 2061 1451 2061 1791 1911 1971 1411 2591 2091 2271 1591 2341 2031 2121 1701 2761 1661 1751 1741 2431 2271 1851 1221 1601 1151 1521 0821 0991 1371 1641 1511 0531 2211 1401 1111 0461 1241 0301 1011 0971 1021 0971 1161 0681 1141 1191 0791 0561 0621 0891 0601 0871 0851 0421 0391 0551 0291 1061 0521 0631 0929711 0501 0199841 0411 0421 0419889861 0031 0559641 0519871 0001 1171 0001 0239721 0021 0131 0379759361 0059931 0441 0059699259749411 288 729100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 975 59700000000004 610 764 15100000000000007 288 458 08200000000000141 659 716 41200000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %1 015 522 74999.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 014 548 92099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %973 8290.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %508 486 47150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %996 039 91897.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

11.7 %118 983 29011.7 %88.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

42 272 8771 006 899591 9051 212 404892 111974 6251 039 9151 800 046647 8001 086 494461 333414 927563 951673 428344 549802 452565 129644 309809 6141 175 2701 308 0011 186 3451 381 0921 080 6311 710 4422 890 721173 2625 091 725260 201261 716527 281561 633243 523650 682258 126256 223416 416605 099142 734908 72313 367 633554 214472 099885 800726 3451 381 5511 247 6161 732 8083 373 227291 210445 817361 246514 281200 082352 357435 932268 3201 283 351268 819635 103912 999 300051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M900M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.91%99.91%99.91%99.91%99.9%99.91%99.9%99.9%99.9%99.91%99.91%99.91%99.9%99.9%99.89%99.91%99.88%99.89%99.9%99.9%99.93%99.72%0.1%0.11%0.09%0.09%0.09%0.09%0.1%0.09%0.1%0.1%0.1%0.09%0.09%0.09%0.1%0.1%0.11%0.09%0.12%0.11%0.1%0.1%0.07%0.28%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped