European Genome-Phenome Archive

File Quality

File InformationEGAF00008087508

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

50 850 32449 068 8975 875 46211 448 2411 559 4383 202 016715 5591 073 026415 315456 402258 877256 424181 798164 411133 965121 310104 04295 09982 21175 45666 97162 72957 57254 08449 29045 04941 50840 13437 84436 81734 01032 70430 37329 05827 58625 67625 64723 90323 17522 36521 82020 43020 91819 74419 09418 53317 87217 52016 69216 65015 51315 66014 87714 61614 18414 08913 55513 57213 10612 91112 61012 58211 86711 90311 53611 63711 44011 24710 95910 82410 65210 2079 9929 7359 7419 5299 3249 1299 0418 8858 6198 6868 5598 6598 7188 3038 2238 2117 9277 9488 0608 1867 5687 4367 4627 2687 1447 0977 2067 1526 8747 0616 7296 7976 6496 4466 5616 4986 2956 3786 2316 2556 2426 1336 0146 1255 9795 9085 8815 7925 7265 9895 6335 4925 5645 5135 5995 5235 3755 4025 3215 0975 2205 1055 0025 0185 1044 9904 9125 0674 8304 9034 9644 7414 6854 6744 5064 6634 5294 5104 6834 6114 5584 3504 2904 3284 1214 3994 3044 4084 2044 1984 3094 2044 2494 0254 2854 0663 9104 0294 1104 1064 0553 9244 0314 0664 0773 9313 7163 7903 8833 9213 7973 7463 7893 6993 7463 7853 6883 5483 7203 6383 6733 6243 6723 6143 5453 4723 5153 5283 5133 3183 5353 4213 5363 2663 3143 3593 3283 1653 2103 3543 4063 2803 3833 1773 2013 2593 1123 0043 1873 1283 1383 0083 1613 0372 9783 0483 0032 9512 9843 0453 1702 9822 9342 9982 9962 9432 9962 9212 8492 9292 8212 9402 8792 8642 8182 8642 8742 8782 8122 7412 7532 8082 7732 6512 7242 7492 7102 7122 7012 7262 7232 6442 6942 6122 7912 6722 7142 6842 6402 6442 6892 6132 6352 6382 5642 5112 4572 6002 5692 5942 4862 7142 4862 5602 4222 5062 5622 4842 4292 4382 4182 5262 4902 2812 4622 3642 4182 4782 4472 4872 3392 3792 3842 4082 4382 3322 3642 3282 3072 2742 2702 3762 2292 2402 2492 3622 2872 2532 3322 3872 3172 2432 2952 3412 2342 2392 3382 3162 2862 2052 2382 1592 2402 2792 1682 2152 1972 2612 2202 1312 2852 2582 1992 2292 1572 1992 1312 1052 1162 1742 1152 1512 1982 0392 1552 1272 1832 0712 0641 9962 1192 1062 0672 0612 0432 0832 0332 0561 9952 0192 1032 0172 1271 9241 9032 1171 9432 0402 0592 0002 0771 9511 9981 9262 0291 9881 9101 9842 0021 9141 9281 9511 8971 9062 0091 9321 9401 8901 9091 9691 9371 9321 9671 8861 9511 8851 9331 9241 9641 8821 9031 8671 9271 9431 8941 8421 8901 8481 8911 8791 9151 7911 8331 9901 7971 8501 9161 9681 9311 7861 7761 7911 8401 8061 8751 7961 8201 8211 7901 8531 7561 8121 7701 7911 7441 7871 7891 7491 8701 7391 7001 7971 6601 7701 8621 6671 8231 7381 6721 8211 7151 7471 7081 7011 7741 8021 7481 6931 7381 7511 7791 6901 7341 7201 7141 8161 7491 7251 7061 6991 6461 7281 6781 7421 6341 6351 6231 6881 7461 7191 6741 6551 7101 6941 7391 6181 6811 6971 7011 6881 5991 7511 6791 6541 6841 6681 6571 5661 5721 5941 6391 6731 6231 6141 6331 5591 6281 5401 5671 5971 6411 5971 6711 5571 5521 5001 5431 6441 5901 5901 5881 5721 5291 5801 5071 5751 4761 5901 5701 4671 5861 5941 5591 5731 5011 5401 5831 5981 6131 5701 5401 5221 5121 5511 5361 5381 5241 4881 4961 4561 5311 5691 4011 5061 4551 5071 4621 4971 4991 4651 5131 4491 4981 5591 4561 4411 4991 5201 5311 4881 4481 4481 5141 4841 4821 5041 4831 3951 4711 4481 4451 3511 4301 3921 4501 3621 4121 5141 4341 4521 4411 4291 3951 4201 4441 3691 3131 4651 3571 4041 3231 3141 3231 3751 4061 4081 3251 3471 3861 4281 3421 3791 3501 3591 3721 2991 4311 3641 3371 3131 3491 2801 2891 3731 3491 3091 3301 3911 3861 3211 3361 3021 3571 3281 3561 3471 2871 3721 3641 2941 3501 2871 2871 3021 3171 3571 2851 2671 2521 3391 2021 2761 3311 2411 2401 2601 2771 3101 2571 2221 3401 2331 2431 3081 3321 2981 3041 2641 3061 2131 2761 2501 2151 2551 2741 2671 2161 2281 2951 2511 2581 2101 2091 2291 1891 2431 2591 1831 2241 2191 2621 2171 2101 2131 2261 2401 1421 2141 1751 1521 1401 1781 1571 1171 2051 2571 1861 2141 0991 1421 1271 2181 1411 1961 1551 2111 1341 1371 1511 1621 1801 1661 1871 2081 1941 1631 1911 1361 1681 1751 1311 1641 1681 1791 1461 1021 1971 1681 1811 1641 0711 1261 1591 1171 1691 1771 1091 1221 1701 1141 1561 1611 1281 1761 1211 1541 1511 0771 1971 1151 1071 1591 1471 1711 1581 2061 1741 1381 1471 1091 0841 1541 0991 1481 1561 1021 1411 0961 1231 0701 1811 0931 1041 0861 0381 0041 0971 0951 0321 0911 1011 0511 0831 1271 0201 0871 1091 0831 0901 0831 0049981 1041 0761 0501 0431 0501 0731 0269601 0511 0131 0441 0421 0801 0839661 0311 0101 0411 0271 0671 0001 1231 0561 0351 0101 0671 0461 0101 0641 0529621 0671 0901 0341 0241 0129439929901 0419939571 0239809629729811 0389881 0151 0231 0131 0341 0229229941 0429679809229759671 0319949331 0209269309079569739339489429649379431 0189919769549879359539221 021933947922920966914947927968943930958987923943939962926943943954964860903960903918865926936901987937916883928923928894852889938854874877871905876937899908864937852883889880927925898844867889857960852883826833930921834851895873911853825845889851859881830851836845839875860863845879822799803865804801857789 613100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0019 921605 349314 56222 64240 98719 2493 887 2383 852 8533 804 6155 124 30611 010 3656 850 6361 882 7231 035 167546 75189 714237 0232 912 3421 430 847336 8552 396 2804 148 4463 496 28113 660 33030 182 85472 746 87781 189 65053 470 520163 401 918296 187 160321 028 931358 227 536237 625 792260 692 439414 700 707242 343 564202 570 5290000510152025303540Phred quality score0M50M100M150M200M250M300M350M400M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %28 983 65299.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %28 954 44299.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %29 2100.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %14 509 72050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %28 697 50498.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

69.6 %20 197 07569.6 %30.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 779 6114 4762 5227 9763 9493 8285 4836 0473 78411 6655 7923 3133 9024 5151 9467 6683 0392 6414 6924 7863 45113 27714 55019 29119 83934 5372 849103 0062 6602 6373 7754 9931 16310 8472 0962 7793 5653 8462 1449 91783 9934 4127 8118 84510 96627 37720 10832 55546 50627 278127 42818 55916 4457 9399 27411 3473 13312 2413 1163 86626 522 230051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.89%99.91%99.91%99.9%99.91%99.92%99.91%99.89%99.9%99.91%99.9%99.91%99.91%99.89%99.89%99.92%99.9%99.81%99.88%99.88%99.91%99.59%99.9%0.09%0.11%0.09%0.09%0.1%0.09%0.08%0.09%0.11%0.1%0.09%0.1%0.09%0.09%0.11%0.11%0.08%0.1%0.19%0.12%0.12%0.09%0.41%0.1%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped