European Genome-Phenome Archive

File Quality

File InformationEGAF00002395288

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

397 303332 505317 110342 119387 210453 230554 267724 989986 0121 372 1791 926 3002 663 7863 601 5284 757 5796 079 4457 544 3299 102 71610 688 50612 323 14813 982 16415 621 54017 326 85019 157 55121 242 94523 632 97626 420 02829 763 32733 641 96338 072 81843 044 03348 523 36354 469 68460 683 19767 113 56273 515 04379 738 27585 586 67390 876 80595 551 11599 465 084102 434 680104 406 376105 404 449105 430 436104 350 362102 424 21499 586 83495 922 99991 612 20986 749 74681 381 90075 787 10069 990 43564 110 78258 242 00552 573 17547 035 12541 814 35736 878 59332 314 74028 132 85124 329 85020 879 22517 819 96215 132 68512 770 31810 726 7438 965 0417 460 6526 184 8235 108 4754 201 1513 458 7902 845 7612 347 0521 936 5171 597 8491 328 5451 115 632943 703806 359693 482606 736535 058479 240433 367398 076365 127341 784320 124303 972288 830275 267264 221251 711241 799232 589224 049217 926210 412204 558197 722192 008187 284179 810173 323168 438163 286158 495154 135150 063144 569140 379135 807132 213128 822125 747121 886118 500115 117111 841109 653106 808104 188101 05299 01997 26995 41693 81192 62190 74688 99887 38385 48683 92582 96781 68380 74179 33077 71577 26175 82575 41174 17373 16672 26571 93371 20870 52069 46368 66367 07465 95964 75864 68363 06062 73161 29560 51359 44957 87957 21257 17255 52154 75253 54052 70851 38750 55949 66349 48248 37047 89447 14046 17645 57844 71943 99343 39842 58241 92141 05240 09539 58838 75637 57337 21836 47935 78535 27534 85534 32633 88233 37032 88132 23232 11731 32031 04830 14030 07529 47028 49228 46527 87727 67027 34127 17826 42926 25625 44124 95424 78824 67423 78324 28223 66123 42823 26422 53722 46921 93321 82021 74521 63621 53320 82220 89320 43820 25020 11719 91519 39319 22519 13018 41018 47518 22918 07718 04817 88517 57617 48517 29517 31717 11416 93516 83916 48616 56016 47216 18315 84115 85815 77615 23515 42815 11715 22415 00914 73214 56114 57814 45113 98814 05413 81813 46013 71413 13913 25813 01413 25213 02212 98912 59012 61012 27412 32512 54312 31912 25911 91712 01011 55111 57111 46711 47311 10611 12811 09810 68510 83110 71810 65110 23210 08710 04210 15410 0309 9559 85910 0649 6829 7429 6339 3339 2509 2449 1539 1909 1449 1978 9528 6378 7518 7908 5158 5718 6368 3948 3618 5258 4828 3697 9598 2387 9208 2407 9668 0138 0057 9027 6847 5687 6297 8227 5447 5477 5517 3387 3647 3637 2427 1117 1867 1467 1467 1056 9786 9796 9346 8166 7986 7416 5176 7986 7286 7006 6346 5106 5116 4806 3426 3106 1046 2486 1496 0836 1716 0336 0156 1135 9105 9195 9996 0676 0825 9495 7155 7785 7315 6455 6635 6765 6915 6345 7945 7265 6355 4815 4895 5615 6385 5665 5005 3115 3535 1215 2505 1445 1244 9834 9714 9814 7775 0384 8574 9845 0094 9544 8404 9344 8804 7684 8474 6954 6794 6764 6484 6194 6954 6534 6454 4624 4534 4094 3544 2864 1894 1714 3504 2734 3404 2434 1274 1964 1464 1654 1304 0834 2704 3204 1703 9684 0863 8334 0493 9863 8543 9903 7313 7243 7663 5883 5913 6313 6083 4393 6633 6353 5633 5663 5633 5073 5163 5563 3733 4193 4473 3963 4133 4183 3453 3533 3023 2953 2293 3003 2583 2393 1073 0523 1253 1213 0023 1413 0723 0523 0823 1023 0582 9923 1453 2153 1153 1253 0103 0193 0262 9832 9552 7752 8112 8122 9002 8472 8282 7862 8012 7892 7652 8212 8402 7772 7782 7762 9202 7472 6272 6502 6642 6082 6772 5872 6332 6052 6442 6382 7312 6282 5952 5392 4552 6052 4172 4702 4102 3372 3962 2862 3842 4772 3972 2722 4092 3552 2632 2702 3162 1652 2412 2162 2032 1662 1492 1822 1162 2082 2472 1132 2522 2302 2652 1772 1752 2612 2272 1822 1912 1182 0822 1112 1321 9592 0862 0922 0462 0142 1962 1411 9832 0702 0452 0242 0892 0132 0682 0762 0271 9721 8811 9461 9372 0621 8841 9661 9451 9551 9882 0401 9561 9381 8471 8371 8361 8981 9561 9181 8761 8431 7631 8431 7321 7471 7131 7131 6471 6671 6341 6661 6761 6891 6991 6341 6681 6471 6111 6491 6341 7011 7061 7221 7461 7531 5571 7081 6561 7271 6261 5421 5541 5391 5971 6381 5761 6051 6281 5691 5781 5451 4411 5531 5861 4421 4981 5741 4621 4491 4561 5321 4291 3701 4261 4051 3361 4011 3641 3741 3131 3871 3681 4041 3791 4321 3871 3421 2901 3291 3641 2591 3301 2731 3251 4031 3311 2661 3061 2511 3521 3451 3781 3571 2991 2011 2401 2831 2791 2891 2871 2241 1851 2051 1661 1731 3371 2401 1731 1461 2331 2221 1231 1081 1261 1381 1661 1581 0731 0991 1101 1251 0551 0641 1351 0891 0831 0331 0491 0501 0451 0481 0529341 0721 0521 0351 0621 0711 0971 0981 0599639961 0829871 0201 0791 0721 0441 0031 0081 0361 0491 0121 0441 0411 0411 0191 0089829849539449409479629371 008992895923970940926915922938888852953891896958889921901875882863883898942984973918880937899886845840853891862850843848908869925925907928870955933891890891857855803864764875866797819788825802852738766781870811888772789768772765820843766862813753820861856886833847805857845823793796762801801780821790800821766822773784701768774732818745728792820846817802757828743738745763738775782708733772744720745770777791748655752734744736692703688647702664640681638637598665657681648636636658647608674628667702653652616626619620658621588580605629587574573582561552526565535547580571575562599595628629584598616581575559525532465539529574588544576582572564562587546552561483547525543516529498512516525523471515441753 145100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0017 281 48300000578 486 5320006 282 146 7080000000003 675 828 33000004 481 552 92300008 864 534 483000018 370 510 70400092 468 796 20500510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %886 814 50799.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %883 511 52699 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %3 302 9810.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %446 156 08450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

95.1 %848 370 23895.1 %4.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.9 %115 221 57412.9 %87.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

43 082 3111 678 2241 414 3591 791 3091 170 770939 7761 167 8951 045 037511 7101 353 073967 861929 0421 154 1221 082 855602 9861 265 933804 348757 254956 756953 439645 8501 099 122985 897914 7781 450 5152 086 345263 5613 656 491352 076322 097583 821568 192209 927752 660312 661298 288530 033626 468168 124893 76613 365 019765 3373 856 047984 623723 774590 776274 899274 114695 335895 0434 485 0551 405 5361 069 6141 027 373974 4293 166 8561 132 1741 137 295967 426865 826779 483 8143 5122 8462 9433 1993 1272 7942 8812 7102 832879 9090510152025303540455055606570Phred quality score100M200M300M400M500M600M700M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.63%99.63%99.64%99.65%99.64%99.63%99.64%99.64%99.65%99.65%99.63%99.63%99.64%99.63%99.64%99.61%99.61%99.65%99.58%99.62%99.57%99.6%99.58%99.58%0.37%0.37%0.36%0.35%0.36%0.37%0.36%0.36%0.35%0.35%0.37%0.37%0.36%0.37%0.36%0.39%0.39%0.35%0.42%0.38%0.43%0.4%0.42%0.42%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped