European Genome-Phenome Archive

File Quality

File InformationEGAF00002528144

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

21 534 9099 256 8155 716 7445 023 7593 911 0123 866 1343 159 8813 227 0782 752 9722 829 3622 486 9082 559 2902 267 0632 358 9502 065 4672 176 4851 910 9651 989 6291 739 7551 832 9721 607 9171 722 6821 445 7401 584 8231 317 0961 419 6031 206 6271 301 1061 106 7851 190 453948 2331 043 062884 336985 338783 841863 693698 629756 146594 844687 771527 523603 861467 343543 839386 738487 382340 675430 428286 117385 494261 193334 058223 581308 474198 847299 254180 966268 414165 200241 473141 509213 121126 833198 796118 398188 167108 183176 54992 619167 94189 361154 62080 987148 45278 516128 93068 761130 48865 872120 57362 664112 67153 55797 10250 80293 95550 13088 07247 10984 32043 91979 17240 42870 46536 39563 82635 24458 83032 79655 05531 60055 67130 74950 08928 55747 07826 10143 14323 91844 20426 02641 38522 74439 47824 96440 00223 94435 26620 95033 11721 86331 05621 08029 65020 54726 40218 28025 33718 91327 75919 20323 98018 95625 56218 56321 98917 72322 19415 35321 75016 09220 01716 30820 45413 07618 54215 45617 72513 87018 75214 17717 73614 32316 78914 01617 66712 73815 31512 51615 61812 08714 54811 04613 80312 04413 56910 61713 32110 96511 79910 58412 31110 80212 40211 62211 7359 46912 49111 07411 73510 53610 95710 14411 4179 57910 1899 38910 2979 98512 5819 70710 5198 07310 6929 4239 4188 5408 9418 1089 2928 7359 7177 44610 1399 4428 7599 3399 7328 3269 2457 9848 3987 4428 7867 9128 0147 3038 3647 8848 4457 6777 9357 9927 8777 5407 7567 2247 1187 7537 9607 6187 7037 5726 7997 3236 8307 3266 5506 2117 2606 8977 1237 7896 9967 1066 9827 1776 5047 1836 4175 9146 2066 2157 4376 5006 7186 8125 8866 3696 6106 3486 7595 8466 2555 8255 3806 0385 9166 4156 5996 2096 2965 5885 4465 2875 2455 5925 6445 6485 6065 5085 4835 7915 2184 7575 9775 9496 2325 5055 9686 3256 2165 8036 1795 7425 8405 7115 3835 4275 3035 8115 5214 8915 5315 5885 8605 3394 7864 7915 0405 4045 5204 6614 6704 8655 0585 3774 9505 4494 9704 9735 2565 4995 4174 7214 9444 4914 4955 0375 3045 1595 0574 8444 1414 3344 3574 3404 7254 6685 7315 0484 7254 5054 5934 4495 0385 7254 5714 7255 1084 9244 8725 4904 8444 2014 7485 2595 3225 3744 8334 3674 6084 4584 2993 9724 7894 0984 4884 2084 2675 1415 2534 8624 5284 7784 5533 9844 0044 4783 6024 4354 0584 0764 2334 7794 8183 9123 7604 2734 1894 0733 6854 0354 3073 8684 1563 9474 3574 0124 4604 2044 2994 0804 1633 6254 2914 1293 8723 4574 2683 8963 6683 1043 9334 0344 2193 8923 6483 8103 6493 9904 2813 9393 5544 0753 8424 1653 4753 6853 4053 8673 7463 6553 2133 8733 6183 7863 6044 0053 9953 4373 5473 5313 7514 3353 8313 8893 7613 6423 9593 4783 4893 5073 1333 4733 2184 2813 8183 4703 1783 3753 6803 1813 7103 5653 1893 2203 6983 4993 3703 5163 5023 1253 6883 6003 3143 7653 4053 5133 2333 8753 7163 8753 3523 8123 5533 2413 2823 4814 1463 3163 5253 6183 0752 8953 5663 5833 8053 3313 1883 2553 9093 5203 6993 5503 0283 6763 2473 4242 8662 8723 5723 4463 3353 6273 5583 0973 2103 0372 8493 2083 0963 0102 7833 1043 3803 4333 7753 4873 4383 0163 2292 8843 3653 0163 0043 5563 3273 5493 2123 1603 1753 2143 1843 2633 4543 1493 1992 7313 2223 3832 8682 7152 7852 8913 1353 3592 9953 1262 8373 1042 9713 0732 4883 2033 4982 9213 2673 2042 8822 8473 0663 3143 0833 2952 9412 5293 0243 1073 0693 2843 1332 9802 8533 0632 8652 9913 0262 8592 7513 7112 8592 8542 9313 3173 1312 9773 0892 9462 6923 2222 5742 9182 6412 3932 8323 1872 8732 6982 7403 1912 6262 9132 7342 9503 2612 9002 7632 9793 0182 8192 8922 6812 5602 6282 9422 6253 0053 1552 8322 6143 1422 5522 6083 0762 7143 1172 8852 1782 4252 7682 8672 5602 5662 7082 7603 0602 6542 6972 6272 6302 6192 4482 7322 4352 3922 5582 8712 7462 8472 8372 6072 7832 7652 7912 8882 8032 9222 5782 8892 9252 7912 7232 3192 3292 2192 5922 7852 6792 5722 9182 7622 6922 5742 5302 4842 9152 5122 3922 5492 9003 0372 2992 5662 5682 4522 9912 7662 4452 6502 5692 6513 0812 3622 9872 8002 7852 7662 7322 7792 6432 5672 3672 6192 8882 5762 7122 4712 6272 3502 3362 5422 5002 5502 5822 3272 1162 8162 5962 2062 4092 2912 6202 5883 0282 5472 7232 4282 4832 4753 0432 4482 5982 6912 5842 4232 6052 4432 5942 6472 3672 4222 7262 8162 6982 5892 3362 2892 5842 5262 4622 1512 4102 6382 2502 3992 2222 4092 4542 1852 5372 4072 2362 6192 5592 1082 1902 6162 4362 4492 4222 8182 8992 6502 2042 3572 5772 4772 5972 6012 5522 4072 3952 5662 3392 2342 3612 1652 3832 2192 3092 1912 5692 4022 3162 3692 4222 6382 1641 8592 1382 5862 4592 3172 1592 1822 2612 6502 1252 2592 1922 2632 2952 2882 4622 3312 2652 5502 3802 3182 2542 5012 1432 1401 8732 4541 9032 2531 9701 9692 1292 2702 4582 2501 9872 4252 2392 2332 5582 1671 9701 8501 8271 8161 9211 9932 1442 2712 1942 4641 7802 0371 9931 9772 4251 9512 4531 9862 1512 3402 1932 6152 5771 9712 3271 7902 2691 9572 3172 2022 0412 3622 1901 9931 9562 1371 9472 2811 7281 7881 9742 0332 3232 1572 2522 2021 8021 9982 0922 2762 2462 0832 2922 5312 1991 9321 8461 7912 3602 1021 8122 0772 0732 4422 0882 1531 9892 0472 3212 1612 1832 3372 3221 7602 0332 1071 9052 1422 2162 1701 9772 0831 8652 2102 0971 7861 9752 0702 0661 9182 0712 3332 3492 1472 1962 0942 3932 0992 0882 4602 1672 3762 2782 2372 0651 9942 0992 3372 1422 0152 1462 3452 1462 2062 0632 0392 2062 0932 0801 9592 0102 2362 0781 7252 3391 8562 0461 9992 1842 2211 9252 2162 2631 8901 9851 7512 2432 0352 1642 0461 8151 9001 8811 9562 0041 6331 6262 0421 9652 2243 202 765100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

12 762 328000000023 294 714000347 706 973000000000222 632 9870000282 230 2910000643 596 13800001 441 925 9370009 557 966 63200510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %83 453 31199.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %83 392 77899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %60 5330.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %41 773 72050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %81 795 92497.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

97.3 %81 300 73897.3 %2.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 418 88120 76813 53534 45314 57417 39926 07224 12436 81438 88513 63914 73517 27220 7939 07529 8608 5239 75215 04214 99115 01633 91331 55721 85242 797100 2537 358349 1085 0736 41614 96814 20513 79929 0545 5937 90213 89815 9435 49736 519411 83226 28515 90145 04525 65370 234102 22957 483343 61818 76527 52932 75231 38411 30027 87121 174138 19294 21818 15141 15680 766 439051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.93%99.92%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.92%99.92%99.93%99.93%99.94%99.93%99.93%99.92%99.92%99.95%99.9%99.92%99.93%99.95%99.95%0.07%0.08%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.08%0.08%0.07%0.07%0.06%0.07%0.07%0.08%0.08%0.05%0.1%0.08%0.07%0.05%0.05%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped