European Genome-Phenome Archive

File Quality

File InformationEGAF00003611552

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

16 159 32713 342 63612 615 34111 838 1529 039 3317 366 5904 810 4124 017 3052 279 4472 243 0741 157 1651 401 310640 516900 358402 362621 520259 512424 268195 717305 800150 046226 588123 314182 745102 692143 44791 491110 89179 02793 71369 81379 07263 91570 80959 18063 08555 79356 34849 88050 09247 92446 33843 78344 25843 99941 59539 69039 41639 30237 70937 05935 71335 96135 50235 25333 51133 30033 00532 01531 94031 86530 46829 72729 90729 93729 34328 79028 34427 88426 98426 35226 32126 73725 63225 50925 53125 58024 59224 45124 51723 76424 06423 31423 24923 63722 26922 46622 36022 18221 33622 05921 41221 64320 96521 47520 74920 43320 71920 03020 11220 15120 19919 88419 39619 23219 27719 03318 75918 97318 05418 60118 20817 92217 91617 95317 63818 25617 88817 62217 34517 63417 51817 26817 30717 27116 91616 84916 24316 96116 48816 24316 23116 28416 30516 17916 14115 89915 73315 48615 81215 70015 26715 43615 27915 15015 31614 61515 03214 92214 49914 68615 01114 98814 45114 61114 14014 10913 95213 84913 87713 83913 66113 70713 63513 70813 18913 36613 40113 16013 02712 99213 20212 72912 85613 19413 02912 74812 82112 77012 64812 64912 51612 35912 48912 26912 19312 02412 20412 18612 03312 07812 07412 04811 65811 50212 02711 57911 79311 94611 65011 53911 61011 42711 33711 32011 47011 13811 59411 06611 11510 90910 63910 89610 79310 74410 59010 33910 46610 36910 26710 52610 25510 41710 4709 94210 3649 87910 18710 12910 0149 98910 16610 1949 8779 6679 9609 9169 6029 9719 9539 8019 6359 6509 6949 4629 7319 7269 2639 2189 3859 4719 3919 2549 2619 0658 9899 0599 0458 9189 1258 7249 0359 0099 0408 7968 7288 9368 6359 0088 4598 4748 5798 7268 7208 5978 5308 3408 5278 3618 4008 4468 1388 4808 2868 3548 4588 2418 2598 2208 2898 2368 3458 2608 0817 7938 0808 4118 2447 9627 9487 9117 9248 0357 8187 6307 9187 9607 6967 9327 4657 6667 7077 5947 6897 7797 5247 9177 5007 4047 5817 6357 4827 4337 4287 3627 4187 2527 3567 3567 4247 5977 3536 9887 3197 2737 3767 1876 9577 2187 2017 1866 8237 2596 9557 0736 9386 9867 1036 9326 8766 8746 9316 9946 8106 9146 8676 7176 8816 7576 6906 6396 7856 6906 5286 6316 5826 6516 4406 5126 5496 6326 4726 5356 5676 3846 3556 3716 4396 1096 3556 1526 2546 1236 0026 2256 0576 0726 1416 0016 0605 9866 0245 9045 7545 8625 6735 7575 8205 7385 6785 7425 7835 7055 7155 7755 7445 7285 5505 2155 6545 4625 7405 7905 6615 4655 6495 4745 4605 4455 5925 4205 4845 5485 4335 5535 4565 3175 2915 3255 2825 0475 0865 2865 2105 2025 0545 2875 1425 1315 2345 0225 1995 1435 0084 8825 0034 8904 9584 9034 9694 9284 9584 9254 9884 8744 8144 9444 8474 7824 6884 9244 6234 6734 5974 4864 5504 5844 6334 4654 5824 7434 5504 6244 5094 6604 4984 4914 5314 4624 5744 4694 5604 3834 4434 5974 6144 3584 4924 3664 1714 2674 3474 2924 1794 1354 0404 1214 1464 0874 0943 9103 9884 1783 8663 9764 0413 8783 9543 9193 9073 9893 9654 0373 8583 9973 7923 6863 8723 7093 7993 7663 8773 7603 7693 8673 8363 5813 7503 7873 7523 7353 6603 6413 7883 6993 5243 7243 7453 7073 6603 5473 6743 6033 6083 6553 5673 5473 5153 4953 3983 4143 5233 4143 5083 4653 4403 3743 4033 2893 3593 2913 3493 4203 3503 3403 2603 3473 2553 2513 1223 1223 1953 2033 1923 2413 1683 1453 0503 2353 0472 9833 0373 1022 9992 9063 0813 0303 1033 0323 0702 9162 8582 9722 9022 9842 8722 8612 9512 9792 9652 9332 9072 9202 8012 8402 8492 9152 8972 7722 8422 8622 9192 8222 7782 8142 7632 7652 7452 7422 6352 7462 8012 7332 5192 6342 5842 7152 5952 4952 6412 6482 5602 5742 5672 5802 5532 4812 5162 4982 4712 5132 4432 4912 4352 4182 3962 3922 3712 4352 3812 3962 4302 3812 3432 3532 3772 2672 2882 3562 3602 3082 2812 3532 2932 2852 3092 2352 3152 1762 1692 2562 1742 2182 2442 3122 1892 2692 2082 1412 1342 1432 0722 1102 0302 1912 0452 0952 1592 0982 0962 0822 0962 0442 0721 9272 0491 9672 0291 9791 9601 9481 9691 9641 9481 9392 0021 9571 8901 9811 9651 8811 9391 8861 9991 8531 9261 9211 8891 8931 8741 8241 7161 7991 8681 8541 8591 7131 7461 7971 8041 6651 7531 7211 7131 7421 6931 7791 7011 6251 6941 6381 6581 5751 6151 5831 5151 5191 6511 6581 6561 5651 5421 4951 5621 5031 5271 4731 4831 4821 4271 4171 4271 3801 3771 3371 4501 3741 4601 3891 3811 4081 4581 3791 4031 3661 3451 2681 3511 2961 3141 2461 2561 3111 2311 2661 2051 2431 2261 2811 2421 2311 2141 2641 2831 1951 2671 2501 2561 2691 1721 1781 2751 2191 2371 2841 2621 1621 2401 2741 2501 3001 1771 1581 2111 2191 1801 0531 1161 1201 1231 1381 0571 1221 1331 1371 0901 0931 0501 0071 1231 1201 0521 0721 0831 0731 1071 1211 1371 0791 0031 0991 0001 0741 0091 0581 0381 0671 0541 0481 0921 0691 0321 0669711 0461 0049341 0381 0239241 0871 0219959591 0139919761 032939966960929981929907930938954919984939904978929914905869925959946929964956879900891927971921893874845933992900909871902896933846875904891836879894856924847780874892859820818854822852902846875850873891867851894846835864857812763762865872804831838858840772786833791786825819789816788800798791771737787790828846744803760797722728735756725754769732778666760702692691706710707758722669740743655764706217 380100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

639 418000000015 402 878000146 257 07100000000088 446 3940000103 018 9580000184 004 2630000368 611 6420001 485 265 27600510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %15 917 75099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %15 892 20099.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %25 5500.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %7 972 15350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %15 752 45298.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

77.5 %12 360 68777.5 %22.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

308 3054 7712 6526 9843 2893 5265 7535 5614 5517 4632 3012 1263 6993 6371 4185 8831 8211 8723 9614 3984 1606 7556 2585 0708 96620 10394262 9431 1451 2753 4632 9551 7515 9181 1531 3923 0093 0519877 74095 4694 8823 9587 8606 42513 97218 95414 50960 1154 0905 7794 9826 6733 1825 8165 6214 65120 0375 2469 85015 191 415051015202530354045505560Phred quality score2M4M6M8M10M12M14M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.84%99.84%99.85%99.83%99.84%99.83%99.85%99.85%99.84%99.83%99.85%99.84%99.83%99.85%99.83%99.85%99.84%99.84%99.85%99.81%99.83%99.84%99.85%99.83%0.16%0.16%0.15%0.17%0.16%0.17%0.15%0.15%0.16%0.17%0.15%0.16%0.17%0.15%0.17%0.15%0.16%0.16%0.15%0.19%0.17%0.16%0.15%0.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped