European Genome-Phenome Archive

File Quality

File InformationEGAF00003611287

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

41 002 52534 410 14123 375 74716 928 7758 948 1626 826 5263 061 7792 952 5651 220 1911 444 830602 882790 676361 307462 192250 995289 091179 133203 096140 880150 108117 266116 74298 20697 54887 28484 54477 28073 71269 57067 65664 44361 60258 98556 24154 79353 04652 86650 68048 93047 37546 61044 90943 54942 29041 25240 97239 56938 81637 47637 84436 16236 67235 58034 91034 68933 24933 51832 26432 78932 70731 24730 88830 45930 16429 61529 22428 57728 13527 98427 67826 92527 10626 56725 95925 59026 15525 33325 50424 81324 57424 59624 62424 62323 66024 36823 85023 53423 19223 15522 70022 88522 50722 78822 39821 74221 75422 18621 73021 75021 30520 84121 07420 10520 78120 58520 07820 09119 92919 27819 38719 19719 86419 51719 25819 56618 81518 33618 74718 43218 42818 12918 14917 81617 74417 54618 01617 66217 53617 44717 12917 23017 37716 43217 06916 88016 46816 32816 34516 67716 37916 42216 22015 88915 76615 62215 72915 91215 52715 37515 59215 47514 79315 43515 07614 45214 81114 47714 47814 83214 76914 68414 37314 12013 89814 21414 35113 88913 53313 71613 48013 74313 45313 62713 40413 40613 37213 39013 46213 50913 10413 29613 12012 95812 82512 80112 94212 65612 67712 71512 25512 28912 37912 38512 21112 28012 13412 11112 05211 90511 95511 98611 79511 81311 77111 74411 85611 53311 97612 09511 75311 68111 55911 38911 72411 30511 37411 27411 18111 54610 93510 98811 25911 18311 04910 98810 76811 05610 68810 85110 91610 86210 93210 69910 40110 50410 57210 28010 45710 40410 30810 19710 1409 97810 3349 98010 6019 89010 18010 09310 1319 8209 8459 8659 7549 5719 7179 6369 7979 6659 3259 3579 4739 1229 5499 5239 5489 4399 3069 3349 1919 2219 2999 3339 3529 1979 2109 3439 1879 1959 0328 9249 0099 0088 8168 7469 2748 6738 8488 8638 6918 7198 4638 7368 9458 4378 4948 6688 6488 5898 2408 2618 3038 1388 0968 2638 2168 0718 0528 0637 9668 1768 1498 1167 8478 1778 0007 9497 8017 9667 8147 7977 8467 9607 9147 9307 8977 9477 8007 5637 7727 7537 5657 5227 3787 4237 2487 6037 2977 4757 4677 1517 3977 2557 2097 2037 2426 9297 1267 1597 1397 0676 9447 0156 8697 0106 8486 9136 7576 9326 9536 9096 7366 7306 6616 7976 6506 6766 4996 4836 5596 4536 3786 5156 4976 3976 4996 3846 3456 2976 3986 1516 2196 0626 1586 1476 2456 2726 0666 1585 8965 8696 0605 9916 1035 8945 9935 8285 7655 7135 6465 8745 6155 6975 5745 4625 5415 6115 3535 4105 3685 5715 5405 4125 4395 3345 6325 4645 5305 5485 2405 5115 3215 3855 2725 3765 2785 2255 1875 1105 2304 9825 1424 9085 0015 0434 9015 0714 9954 7935 0274 7844 8264 9244 8994 7624 7794 7044 7304 8304 7094 6554 7344 5624 6764 7354 6644 7084 5214 6834 5854 6484 5874 5544 5124 6504 5864 4804 6444 6034 4834 4794 3844 3864 3354 3034 4504 3944 2364 3464 2704 0394 2954 0844 1854 2934 2304 1934 1044 1053 9344 1724 1314 1124 0464 0194 1694 0123 8384 0733 9523 9173 7413 9703 8523 7393 6973 7333 7243 7033 6763 7153 4983 6643 6593 6893 5653 6243 5823 5493 4923 4913 3833 4093 5283 4113 3953 4093 3963 3903 3963 2453 3833 2503 3813 3063 2223 2643 1943 2333 2083 1873 2023 1453 0493 2263 1553 0283 1773 1053 0693 0553 1132 9852 9672 9672 9142 9312 9332 9352 9862 9362 8732 8972 8072 7812 7902 8842 6562 7072 7342 7172 8442 7302 6712 5492 6602 6872 5882 7032 6212 6212 6832 5642 5202 5992 6822 6022 5052 4802 4962 4462 5282 5432 4732 4972 5182 4532 5142 5262 3692 4072 3602 3882 2782 4642 3302 2722 2552 3632 3482 2712 1532 2462 1632 1662 1002 1772 2192 1262 1052 0422 0802 1321 9832 0322 0982 0762 0222 0191 9802 0212 0671 9031 9431 9261 9101 9531 9661 9031 8921 9201 8651 9041 9431 9371 8971 9001 8731 8331 8831 8571 8391 8991 8281 8401 8641 7571 7481 7301 7911 8111 7041 6971 6521 7521 6521 7501 7251 7421 7011 7751 7261 6851 6621 6551 6871 7201 6991 6481 6631 6111 6301 6331 6921 6891 5911 6431 5451 5741 5701 5551 5841 5051 4921 4521 5531 5381 5671 5791 4581 4981 4761 4371 4641 4141 3961 4671 4081 4791 4621 4361 4021 4101 4311 2641 3411 3671 4091 3031 3011 3151 3501 2951 2771 2691 2841 3281 3341 2511 3211 2411 3231 2681 2511 2941 2281 2721 2031 2011 2611 2541 1631 2401 1971 2521 3081 1761 2041 1911 1561 1511 1391 1871 1881 2171 1241 1721 1931 2261 1661 1261 1651 1581 1321 0841 1711 1741 0831 1241 1101 1081 1961 1631 0411 1651 1101 0701 1221 0831 0631 0931 0601 0151 0971 0619981 0391 1381 0281 0471 0371 0331 1161 0631 1341 0011 0719881 0581 0971 0609999561 0099989799961 0109811 0029521 038922993952942966977945942874908955984901949930963900985943886937965991955913960964958969942913962888916927924906942887917904844926827879897865855801858858924832929874895873838944933911874891863861879843847857863846831812826812878860830833831851786787808817833835770749786764864822757826772826816814825792806857786810744739829761781701812733780744748769748755756810735800697765780730761778743764778795712748724741747765729721717678690710662742744757745698709675640633710717683698700758711680697740719697677676717685719703743714636735720680696700682707681708694692645113 105100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

110 39900000005 086 274000127 999 61600000000077 438 660000087 991 5740000169 148 8550000351 149 8310001 388 217 99100510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G1.3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %14 682 95699.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %14 652 95699.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %30 0000.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %7 357 14450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %14 544 63698.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

63.3 %9 310 34463.3 %36.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

325 8434 2182 7166 4783 0763 0315 7214 9734 3556 5232 1362 0433 9393 1001 3095 9411 8681 9464 2774 2834 1816 8556 4284 5389 07317 57287560 2031 1931 0813 5492 5841 5314 9501 0631 1193 2552 7807396 54797 3984 4243 8367 4855 90012 47917 09111 43760 7433 4105 6834 5056 4973 1065 3744 9024 38619 3614 4998 73013 949 315051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M12M13M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.79%99.8%99.8%99.82%99.8%99.8%99.79%99.79%99.79%99.8%99.79%99.78%99.8%99.77%99.8%99.8%99.81%99.8%99.8%99.79%99.8%99.01%99.87%0.2%0.21%0.2%0.2%0.18%0.2%0.2%0.21%0.21%0.21%0.2%0.21%0.22%0.2%0.23%0.2%0.2%0.19%0.2%0.2%0.21%0.2%0.99%0.13%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped