European Genome-Phenome Archive

File Quality

File InformationEGAF00003612375

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

65 931 21354 304 65138 740 61226 240 36415 304 30510 099 2145 489 7884 081 7942 155 7981 884 3591 012 364982 441555 606573 962361 501370 231256 307256 278199 031191 124156 428151 892129 297126 630109 939103 16493 78188 86783 85377 52473 02370 64866 17765 06562 05259 50257 63954 53452 53850 52549 03447 71746 92545 59943 98442 89942 50041 26539 89638 27238 53438 53836 56337 05435 18034 58934 37734 38233 21632 57132 73332 14631 94031 50230 37829 73829 15028 87829 09428 20927 55928 23527 83627 21326 80526 46826 19825 91426 02825 09525 56124 63424 79124 46124 05023 52423 74523 53123 63322 96723 30222 58822 79022 35722 32822 00122 21222 07921 79921 46121 46920 69820 87620 79020 58120 49520 26420 07319 99919 78319 95419 25319 03119 39619 08118 97918 85918 67318 85918 35618 23718 03718 19918 03517 25117 83417 37817 69917 86317 02617 11117 34217 07716 95916 67216 93916 93016 56716 51216 13816 88816 39516 18716 29516 16916 18116 35716 57415 96215 78515 87715 89015 64216 03715 53115 25415 29815 37815 26915 07115 14215 03214 94715 12414 78314 86215 15014 64714 33814 05313 97214 28413 94714 47214 11814 32113 74513 86513 91813 75513 57713 77213 54213 37313 70813 57513 67913 44914 07213 46713 55213 44013 34513 64612 98313 11213 33513 20213 14013 31313 19612 75513 10312 99312 69413 02312 54312 91612 78912 32612 29212 51912 27712 50612 61412 40912 38312 61712 44812 05112 10812 28411 80711 65911 85711 95412 24911 94712 00912 12611 67612 26211 79811 86411 99612 00011 39811 61211 62911 54511 31511 41811 40811 42211 59311 00711 47111 04110 99811 28811 26910 95310 86910 89811 09611 06710 92510 92110 93510 96010 82810 77610 80310 58610 55310 51210 48710 25310 47810 48910 48410 33410 44810 16510 33410 08710 18310 05910 14710 24410 18810 3109 92710 0249 91610 0559 99210 0189 8829 8639 9669 9699 9329 5859 5979 6699 8119 7699 4109 8479 6749 7009 6509 5359 7239 4939 5169 4789 4539 1869 4799 5109 6359 2849 1809 0559 0299 3949 0368 9368 9338 7508 9628 9169 0168 7938 8598 9288 9018 6788 9268 6778 7218 9268 6468 6828 6408 6228 7188 5318 4848 3398 4918 4858 3948 3598 3218 3928 2798 4508 4558 3728 2918 2518 1618 1178 0948 1928 3787 9407 9548 1508 0398 0277 8777 9808 0207 9558 0777 8867 8647 8287 9247 7687 6117 8087 8077 7977 7467 6317 7967 7097 6657 5027 6777 5257 5717 5777 4637 4987 6637 3267 4167 3707 2557 2817 2657 0867 4337 3257 3627 1807 0667 1267 2247 1596 9367 1157 1617 0187 0426 8856 9166 6566 8886 9527 0976 9456 8396 9107 0206 8786 7536 8236 8186 7586 8396 6846 8096 7056 7376 6326 5656 6766 5106 5096 4946 4956 6076 6996 4726 4826 3716 3026 3616 4356 3706 4686 4636 6866 2216 4376 4526 6026 1516 4626 2366 2356 2846 2696 2986 3426 1986 2676 2046 1486 1756 0126 0426 0656 1626 1955 9865 7835 8655 9025 9145 8965 8225 8855 9185 6935 9615 7335 7525 7075 7505 6865 6585 7765 6615 5245 5655 6285 6125 3605 5175 4925 5095 4705 4765 4895 4345 5115 2625 2765 3225 3915 4165 2795 2435 3535 2375 2585 2585 1705 0915 2025 3155 0375 1855 1405 2215 1305 0975 0554 9474 9745 1675 1754 8874 9805 0024 9044 8704 7454 9564 6724 7544 7934 8144 7214 9614 7594 7234 8224 7324 8404 6014 6264 8464 7624 8534 6604 6164 6644 5304 5724 4724 6864 5504 5034 3414 5074 5254 4304 3964 4474 4864 4064 3294 3364 3804 2054 4794 4164 5964 3834 1834 3274 2704 2994 2434 2324 3704 2694 2894 2684 2684 2254 2434 1154 1504 1904 2104 3064 2074 0914 0574 1744 1294 0173 9584 0814 0164 0184 1093 9654 1663 9594 0354 1234 0604 1063 9723 9243 9693 9953 9304 0213 9333 9353 7223 7413 8043 7903 7573 7173 7203 8153 7043 6703 7223 7543 7343 6793 7353 6343 7633 7113 7333 8293 7483 7063 5883 5213 6403 5983 6053 5463 5223 6473 5693 5173 6493 6123 5673 4023 4483 5323 4493 5033 4553 4743 5913 4103 5003 4693 6023 2943 4433 2503 2973 3343 3473 3493 3893 3393 3153 2313 2573 2873 2983 2473 1923 2463 2673 1283 1553 1363 1493 0813 1473 1563 0643 0602 9812 9932 9412 9373 0152 9253 0533 0873 1132 9543 0523 0273 0332 9182 8602 8802 8952 8802 8932 8862 9192 7472 8452 8572 8622 8972 7802 8262 7792 7402 7282 6372 7112 7862 6792 7632 7062 7242 6302 7032 6982 6552 6412 6722 6252 5582 6302 5742 6842 5922 5272 4892 6222 6062 6832 6342 6182 4852 5642 6192 6252 4602 4352 3562 4382 3452 4312 3332 4162 3182 3012 4382 3252 3862 2972 3482 2882 3172 2672 1822 1992 2182 2632 1602 1472 2742 2092 2462 1862 2572 1752 1862 2012 2512 1832 2432 1892 1132 1452 1962 1102 1932 0532 1072 0402 0512 0392 1252 0461 9652 0682 0871 9691 9862 0202 0121 9252 0942 0531 9541 9552 0442 0291 9332 0511 9771 8861 9801 9351 8131 8961 8891 8731 9231 9511 8761 9261 9751 9401 8641 8991 8801 9221 7441 8561 8241 8571 8161 8591 8211 7761 7731 7451 6631 7461 8121 7561 8081 7351 7481 7701 7091 8171 7071 7051 6951 7511 7751 6991 7171 6551 6751 6641 6691 7271 6251 6211 6301 6221 7171 5811 6001 5411 5431 5481 6161 6141 5291 6571 5441 5361 5991 5481 5361 5351 4251 5321 4101 4841 4751 4701 4841 4491 5341 5331 5201 4041 5501 4371 3961 4351 3881 4691 4441 5311 4111 4871 4381 3981 4691 4711 4601 3861 3921 4501 4521 4121 3531 4121 3111 3391 3651 3581 4081 4041 3711 3611 3271 3311 3151 3231 3761 2971 3401 2911 2701 2171 2591 3191 2851 2401 2581 2801 2461 2591 2121 1211 2561 2121 1471 1661 1991 2321 2691 2751 2391 2561 1751 2841 2571 1871 1681 1621 1941 1691 1701 1081 2221 1931 1741 1571 1891 1251 1931 2001 1541 1491 1841 2071 1321 1401 1381 0831 1051 1081 0681 1481 0641 1001 0791 1001 0781 0701 122424 285100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

464 804000000010 472 609000213 818 997000000000128 017 5140000142 854 8980000272 291 8630000556 866 8910002 176 741 62400510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %23 270 06899.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %23 197 64299.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %72 4260.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %11 671 76450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %23 029 29898.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

65.7 %15 341 22065.7 %34.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

558 2786 8323 80910 6194 8214 75810 0078 1357 62610 3883 2002 8866 3935 1692 0828 9802 6052 9617 0097 1347 34011 40910 0136 93514 56529 3991 338107 5611 7921 7565 6734 1032 6398 4941 6702 1635 5924 3841 32411 277172 8117 8276 93111 6809 27720 66028 25019 17098 2695 8419 3237 63610 3195 3078 9649 1277 82335 1367 80314 77322 027 635051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.69%99.7%99.69%99.69%99.7%99.68%99.68%99.69%99.7%99.67%99.69%99.69%99.71%99.69%99.66%99.69%99.68%99.71%99.7%99.67%99.68%99.7%99.05%99.7%0.31%0.3%0.31%0.31%0.3%0.32%0.32%0.31%0.3%0.33%0.31%0.31%0.29%0.31%0.34%0.31%0.32%0.29%0.3%0.33%0.32%0.3%0.95%0.3%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped