European Genome-Phenome Archive

File Quality

File InformationEGAF00002397389

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

12 055 70911 427 84513 655 75114 954 37814 493 00612 613 96410 115 7657 559 5955 507 9843 937 6522 869 8732 126 7701 582 2031 213 397928 270741 050571 715452 592353 483293 061239 623207 607179 007152 253136 870123 586108 83595 68287 66882 31776 20472 82264 32061 88159 08456 75554 57549 74144 78545 13543 94342 66339 02538 90336 32536 62437 25334 82233 40031 99531 47829 60228 16730 67527 96128 57827 41729 03927 33927 12526 29925 56425 16323 03923 31925 60723 87122 39823 68522 24822 66021 58021 81721 81921 60221 88622 13721 34320 27120 19020 59419 48819 77319 16318 75319 53219 93518 92018 16018 75818 29818 10918 59117 33217 64717 49418 08718 45418 01217 37516 28516 81317 60616 44416 03916 72216 76116 57315 96216 65015 96716 02116 56815 32215 14715 72216 11615 61814 81514 59714 46215 72514 79514 86415 21413 56414 64913 46315 20214 49813 93214 26613 45614 27714 24913 52114 07814 01314 55913 73413 78314 41914 69813 89112 76913 33912 22612 11212 23613 02913 16212 51713 14613 30812 90112 29812 80712 97112 88211 99012 48912 10212 19411 57012 17912 29512 08612 12412 14512 35112 71812 66712 13512 96712 19012 18611 36412 83611 56911 27111 11512 12811 65911 40511 64811 62811 51311 16010 67411 15711 31411 71410 91810 47410 33110 29910 38110 32211 74010 75511 03910 27010 42310 61710 11910 06510 46710 49710 90611 06510 98810 1639 92911 01710 45510 4919 8249 8239 2899 59610 61810 1759 5189 4328 9819 71010 5349 7949 5979 96610 1439 5039 4849 9559 4229 8899 6849 5019 0139 8899 64110 2839 8958 8299 1728 8719 3898 9278 4468 8549 3719 0048 6839 0978 9849 1299 4918 4598 8269 1698 4888 0858 2328 8268 2798 8079 3518 8658 7619 0128 2608 5638 6828 2528 1178 4138 6278 2958 6888 4108 1308 4298 6947 9588 4557 7427 8828 3208 4608 0588 0638 4279 2528 9508 2758 5118 0327 9017 8868 4498 5557 5737 9018 0547 6968 1777 9027 9117 7797 8407 9797 1938 0308 1707 4207 9458 3197 5827 3117 5797 6597 0447 8957 3317 0817 0747 1777 3607 3837 5117 4337 1937 0777 0217 1446 8937 3137 0537 7187 1827 8837 5057 1837 2666 9156 3306 7696 7947 3527 3657 5207 1556 9437 1937 1327 0866 5616 9297 1827 0737 1986 5747 0656 9827 6636 6697 4827 6127 3416 7106 7477 1416 8516 5976 8446 4896 6746 6876 5217 0256 7776 4586 5875 7966 5916 2636 5916 5626 6796 3086 4466 5136 6186 4846 6706 4446 3566 3196 5046 3856 1306 1666 1306 1356 0546 3325 8486 1876 2615 7955 8945 8445 9516 3456 3296 3046 3975 7115 6976 2445 8536 2375 9835 9125 9385 2325 6785 7656 3816 4326 5445 2406 1706 1455 6575 5735 6565 4075 6485 2446 0055 8975 5775 2085 5045 4075 4495 3775 3425 3875 4975 6685 6335 5915 6545 4885 7125 5435 0175 4715 2905 7545 3495 3485 4595 8685 3275 1615 5075 4735 4605 4075 4625 5715 0985 5525 6545 2055 1025 2445 2205 3174 9464 9235 1005 4835 0345 0175 4425 3024 8725 3195 1795 0115 3584 7264 8404 7795 0705 3185 0915 1775 0724 8954 7984 8345 1564 4994 9725 0124 8734 6324 8814 9344 6715 2484 8364 4054 7745 1215 0055 0704 6074 7664 5464 5964 6124 8685 0374 4614 8204 8124 7224 8824 3094 7214 5344 8874 4784 5744 7424 5154 6094 5944 6414 4674 6144 4004 4424 5754 7194 5024 6304 3954 0344 5934 2184 7714 3634 7854 3884 6154 4424 6724 5314 3824 0964 2554 4644 6314 2334 4984 1844 1794 2344 1904 2834 0834 2544 5304 5104 1584 1113 9584 2464 4563 9104 0354 1613 8264 2484 0374 3314 0574 1984 0874 1854 2074 1404 0903 9494 3204 3154 0843 6904 1903 9684 1013 6503 6403 8123 7794 0544 2574 0663 9634 0323 9613 8493 7074 1953 8833 9134 0693 7023 7173 8123 8103 9273 9793 7233 8883 5404 0183 8403 7033 9143 6583 6864 0203 4354 1104 0933 7983 9773 6743 6643 8513 6743 5213 9683 5303 2863 3283 8593 2533 5703 5943 5053 9203 6363 4673 7393 7983 6273 7773 7253 5713 5443 5313 5723 5713 4053 5303 6223 4753 3073 3883 6363 6923 4223 5623 1463 2123 4953 0883 2883 2083 3343 0753 1543 2433 3873 3793 4963 1043 4853 2093 2493 4203 3843 2003 2182 9383 0153 1302 9233 3023 3483 2273 5593 0142 9483 0033 0343 2383 1983 2853 3652 9553 1313 1702 7003 0743 0603 0182 9952 9992 7553 2582 9792 9623 4993 1822 9993 0853 0082 8563 1383 1812 7782 7222 8972 9753 1972 9762 9352 6932 9943 0163 1993 0312 8892 7743 0622 7252 8712 5042 6722 9882 7622 7492 8982 8912 8422 7283 1712 8872 7732 6542 9232 9362 5082 7222 9023 0483 0933 0702 7542 8012 8102 5772 7412 6852 7042 9532 8972 4762 6742 5352 5672 6292 5582 7042 6602 9362 6082 8542 7062 7412 5712 7932 7232 4472 9802 9882 4792 7172 8582 5612 6962 5122 5392 6022 9972 7242 3482 3162 5412 5412 6132 4952 4272 3372 5152 3382 5162 3392 7572 4372 2682 3902 3282 3722 2552 3602 3632 3662 4112 5222 3342 4782 4912 2792 1732 5942 0872 4972 3192 4422 3942 6342 5072 2762 5392 2412 2022 3032 3902 2572 1202 3222 3282 2832 1422 0972 3132 1892 2532 2282 4792 3282 0672 2882 5142 4672 1842 1462 2962 1652 2022 1342 1702 3742 2842 1641 9732 0712 1352 1022 0442 1012 1502 1101 9972 1892 2862 0432 1122 2312 4052 0362 0031 9552 0312 2241 8931 9852 1102 0882 1681 9101 9321 9091 7832 0911 9471 9931 8842 0631 8911 9562 0181 8122 2062 0981 9701 9941 7891 9841 7001 9842 2071 9142 1812 1381 9181 8711 7682 0371 6501 9401 9882 1831 9601 8611 9711 9351 7071 8831 9281 8641 7101 9002 0491 9261 9651 8471 7161 7921 7791 6971 9301 9941 6901 8401 7251 5341 8661 8631 7551 8691 7571 7801 6481 6301 9581 8231 8261 8361 7101 9101 8171 7271 7801 8381 4641 6601 8031 8831 8091 6351 8471 5451 7211 7001 6881 5761 6751 643902 546100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

616 12600000000000118 852 28600000000069 063 5840000103 855 7670000266 380 0330000521 757 8240003 154 334 23000510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G2.2G2.4G2.6G2.8G3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %56 256 12699.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %56 063 12099.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %193 0060.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %28 232 39950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.5 %55 594 46098.5 %1.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

95.1 %53 689 58995.1 %4.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 290 49012 3464 24227 6355 0604 09417 90013 2645 08421 6089 0968 25923 2968 6914 20217 2884 6517 46318 13815 6739 96825 08518 17024 54542 72475 7403 907225 3054 98713 83622 02410 0593 71222 5854 3756 72810 75411 6623 35226 763687 70223 44121 27632 89630 32639 91437 16044 44948 72597 240101 16468 804308 89614 49370 95331 30818 81898 32913 29111 07652 696 902051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M45M50M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.66%99.66%99.67%99.64%99.66%99.68%99.65%99.64%99.64%99.64%99.67%99.67%99.63%99.68%99.64%99.65%99.65%99.65%99.69%99.67%99.65%99.65%97.9%99.7%0.34%0.34%0.33%0.36%0.34%0.32%0.35%0.36%0.36%0.36%0.33%0.33%0.37%0.32%0.36%0.35%0.35%0.35%0.31%0.33%0.35%0.35%2.1%0.3%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped