European Genome-Phenome Archive

File Quality

File InformationEGAF00003611470

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

57 946 87949 251 09136 139 52323 637 38014 497 5348 786 9425 207 5493 177 8081 975 1551 306 581871 320629 113461 966352 537283 483235 809197 380169 656149 042133 102120 505108 36697 11992 58486 23878 87574 00371 55667 40065 25460 13160 30658 61652 95653 51851 15448 80148 57945 91545 86943 47241 94141 42839 06939 05138 38338 51536 29737 98235 41933 98133 38633 12032 13832 37133 57330 53030 68831 45331 89429 34230 09227 87928 40328 40428 47827 26627 94027 53726 92625 65026 19525 25426 73825 14524 22624 49423 51424 04623 42723 11323 28522 78023 56522 08323 05822 29821 93621 73621 21721 11020 92121 57621 09420 53920 50120 41720 62821 00721 03420 28420 48219 83120 71420 22619 75119 32720 07119 75119 30819 16818 32618 65619 00518 26418 74317 61317 93517 87017 91117 54618 11018 22817 32817 07717 21616 82916 63017 19016 05015 95616 20016 54816 12515 92716 71116 18616 54216 25116 16116 23116 01616 00215 84916 21815 70315 76115 52215 07315 81015 49215 22514 67015 12514 54015 18315 41514 70115 18314 42114 22114 21914 86214 74314 99614 59213 64713 83213 67413 92013 73012 98813 93413 59713 78713 29813 89513 62013 16113 71913 34413 16214 01212 70313 17513 16613 25213 49413 17113 25212 75112 92712 95712 50112 17212 93111 97212 14712 41812 40611 95813 23212 00012 41712 25312 56011 88011 96812 12412 58611 54811 53911 69511 82812 42711 43311 88611 96411 58312 12112 23911 78111 83511 66212 18511 49111 49511 76311 12110 76211 33110 68610 85911 18210 93611 41510 97910 78211 41310 88210 70810 97411 33510 51910 66911 15310 75111 13311 09910 49910 42210 76710 79610 78010 23510 56110 46710 77310 27210 13510 2329 71810 02810 58110 4129 9209 8729 6779 8019 9599 60610 1239 6169 9779 7029 7759 4589 97510 4469 88110 1299 7729 6899 4179 4199 8629 6738 7789 4539 3748 8339 6959 3799 1589 9979 4899 7439 2128 8659 1079 8499 2429 1219 2369 0918 5318 9048 9268 6358 7228 8268 5548 5178 4038 8268 8389 1748 9578 7038 8148 7358 5538 2088 5628 4198 4988 6027 8828 7118 2167 9298 3958 8728 8778 3388 4218 6208 1228 0137 9477 6737 8178 4488 3598 5607 8657 8417 6997 6737 8067 9237 6007 8567 5767 3787 6387 8287 1487 2647 9377 7587 1337 8087 4677 3497 4907 5747 5627 4137 8757 3237 1567 4577 7507 0457 4357 3527 4637 0066 8976 9526 9057 3707 5307 3287 0047 0487 0137 0547 1267 0196 9226 9126 9986 9907 2746 9506 7426 7776 5456 6526 9286 6496 5036 4806 4776 4056 7026 6556 7066 3406 6026 6966 2286 2776 1056 4366 5126 1386 3586 2286 0736 3046 0916 3066 1325 9046 1066 4236 1356 3026 2906 1206 2186 5546 2276 0996 3646 2276 2425 8356 0255 7805 7675 6265 6895 6735 7875 6765 6935 9635 8305 8825 9265 9435 4075 8475 8265 9755 8845 7115 5595 3865 4595 6045 3835 5785 4545 3215 4275 3415 4815 1635 6915 2415 6245 3725 6345 3584 9665 4725 0645 2295 2385 1694 9475 0435 0144 8904 8835 0585 0315 3054 7555 0334 9735 2604 8874 7645 1004 9365 0044 4914 7024 8584 8224 8384 8315 0054 8974 9554 6774 7824 8774 9914 8424 9684 5184 6804 6854 7904 4704 6894 6684 6284 6464 5074 6544 6024 5644 3304 3104 3734 1544 3894 6814 1724 5334 4074 4414 3404 4634 3344 3654 4684 1554 3024 1294 1584 3384 0164 3644 1854 1424 2044 0844 4994 5624 3303 9643 9804 1384 4943 9783 9924 0984 0254 0013 8374 0873 5384 0423 9183 8663 8493 7114 0683 6543 9883 8463 6713 9483 6633 6943 6973 9873 8873 6273 9333 9093 8653 9023 9463 5463 7863 8513 6233 7293 7683 8043 2853 5733 4413 2983 6063 3733 8543 7373 4633 7203 6023 3283 7453 3693 3543 8073 3583 4253 4333 5113 3533 3853 4113 4293 2843 2713 2423 2863 2623 3423 4333 3293 2893 3563 2423 4883 6293 3343 2423 4733 3003 0803 4783 3683 3733 0933 1123 2143 4543 1713 0623 2093 1302 9572 9953 0882 7593 0383 2193 1863 0013 1022 9402 8813 0392 7592 9073 2182 7822 8302 8682 6072 8472 7782 8052 7902 6432 9262 9562 7992 6652 6882 7952 7892 7982 7312 7832 6712 9162 7782 8352 7682 8022 6992 9312 7562 6142 7232 7312 5102 5022 7852 6472 7542 6572 6532 5072 5242 5532 3992 5382 6792 6422 3752 2502 6952 5332 6142 3272 6002 5992 4752 5172 3432 3842 6012 4892 5492 3782 4982 4142 3032 5132 2272 3262 1992 4072 2942 2772 3652 2672 4162 2752 1742 3022 3342 1112 4212 5512 1592 1342 3622 2182 1822 1032 3412 1902 1352 3632 2472 2152 0642 1562 2102 2591 9752 3071 9692 3372 1322 0982 1392 0121 9302 2122 1132 1732 1642 1672 1701 9302 2402 0952 2872 0452 0082 0172 1221 9322 1381 7591 9881 9111 8472 0151 9752 0851 9451 8271 9652 0801 9491 9421 8491 9411 8622 0201 8871 9852 0531 8781 8821 9281 9221 8961 8812 0941 8251 9071 8731 8501 9491 9521 7471 8601 9141 7121 8131 9751 7251 7611 6481 7901 7851 6751 7951 8091 8061 7731 5611 6581 6821 6121 7411 7121 6311 7461 6961 6621 7201 6831 7081 5981 6481 7421 6201 7801 7101 6551 6791 6831 7021 5931 5641 5711 6341 6451 5571 6271 5831 5091 5191 5581 6131 6921 5771 4541 5401 3971 5271 5301 5971 4101 3861 5091 4801 5071 4871 4801 5471 4121 3531 5451 4921 3831 3991 5151 4861 5881 4151 3861 5451 4291 5361 4261 4601 4341 3121 2611 3581 3981 3321 4361 4971 3971 4341 4381 4821 2921 3311 3341 4001 4201 3771 4371 3641 4451 5061 4051 2101 3431 2831 2191 3411 3061 1801 3701 3891 4021 2781 2351 2151 1771 2291 2111 2951 2311 2821 2141 1881 1371 0861 1851 1191 0661 1741 2011 1801 1261 1091 0851 1951 2741 2481 0751 3281 2271 2651 2711 2581 0211 1631 2029781 2231 2151 1101 2951 2081 0561 0221 1571 1609791 0161 1641 0851 1599941 0421 0271 1791 1211 2021 0391 1349921 0641 074966413 343100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

396 9780000000000077 992 37900000000047 103 135000069 318 6510000190 313 0860000368 838 0200002 504 575 05100510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G2.2G2.4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %43 344 66099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %43 253 22699.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %91 4340.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %21 723 58250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99 %43 024 44099 %1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

89.7 %38 987 10389.7 %10.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 028 62214 4404 15422 4215 7585 81213 56412 6146 10520 8427 8817 87717 58710 1205 03217 9935 4117 35912 95716 04610 24919 66517 69725 40044 48182 9693 158198 2644 80111 04418 7557 8342 52519 5232 4474 6809 8619 9292 63322 858581 50617 21915 53928 40923 84132 92430 02137 84738 65576 13284 75052 726261 83011 08755 48325 75811 67880 8399 4677 90440 257 502051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.78%99.79%99.79%99.8%99.78%99.8%99.79%99.79%99.74%99.79%99.8%99.79%99.79%99.76%99.8%99.79%99.78%99.8%99.79%99.78%99.79%99.83%99.81%0.2%0.22%0.21%0.21%0.2%0.22%0.2%0.21%0.21%0.26%0.21%0.2%0.21%0.21%0.24%0.2%0.21%0.22%0.2%0.21%0.22%0.21%0.17%0.19%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped