European Genome-Phenome Archive

File Quality

File InformationEGAF00003611276

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

37 215 14433 490 37425 856 19620 672 55012 532 2999 802 5005 120 7954 678 4152 134 0112 404 3941 001 1901 347 759549 963805 221345 990506 738240 545340 104182 661233 407145 423173 238120 420126 53699 088105 31387 64786 68877 42775 43170 90067 27662 98463 59060 04656 59754 76952 17050 51147 87446 85846 63045 86843 71742 12640 84641 17939 68037 91938 03738 04836 40436 59935 72434 86934 38833 78332 47532 05532 45532 61930 35529 75129 70029 19928 03728 00227 55127 85127 16426 38226 52826 70225 94525 49425 42424 43124 65424 51323 81523 25523 78023 45723 31022 97222 34021 83822 38621 88621 69621 20121 03421 13420 26219 64620 36820 02920 46619 36119 60920 54719 47419 77319 77919 14119 09619 19318 77318 51018 45218 59018 39617 92817 70917 20217 36817 75317 46417 43517 68817 04216 97417 07917 24517 27416 60216 10816 15316 90716 44316 10316 42715 93415 74115 95115 94915 90015 60015 33815 34715 46715 05915 07915 04114 99514 75315 32214 88414 90814 22114 81614 19814 00714 18714 16814 05513 93013 99213 29113 77413 77413 38813 79513 52313 81913 44613 38213 28213 21513 23712 73713 24512 73413 03012 62012 60812 88312 66712 50312 59912 70312 51112 62412 32712 14212 30212 27612 15911 96012 01811 87811 91611 69611 46111 88611 63511 68711 70811 37311 68311 73911 61011 22211 59111 22911 35011 19411 16311 27911 11211 21810 71010 94411 04210 79311 00110 76410 68110 69110 70110 28610 55810 43510 07710 16410 01210 33310 08310 2169 93910 2619 90910 2199 90110 2049 8499 9959 9999 78410 03310 0369 77810 0679 6689 7719 8109 7589 6869 9269 8939 8829 6569 6319 5549 4759 4059 2449 3719 2229 1759 3019 3809 1909 1869 3979 1129 0299 0829 3019 1539 1918 8899 2628 9088 8408 6389 1518 8458 6298 3848 2438 6248 6358 0738 3408 4258 4298 3268 2888 1648 2638 4678 3528 1738 1428 3568 3288 1667 8948 0958 1248 1027 9778 0348 0098 2167 9338 1008 1508 0147 7968 1657 9147 6797 9177 9767 8307 6327 9907 7277 8267 5317 6947 5897 4597 5957 5537 6387 5347 5817 5917 2367 6257 4707 2227 2897 1587 2457 2247 3347 2477 2217 3497 1537 2317 1027 0567 0657 0867 1027 0556 7506 8066 8906 9386 7636 8786 7976 8596 8646 8436 7886 8336 8476 8226 4596 6736 6666 6706 6476 5756 4816 6406 5986 7906 5586 5376 7196 6226 3536 7306 6326 4706 4556 5186 4876 5846 3896 5846 3806 3846 2916 2876 3966 3386 4396 2256 3146 1196 5056 3856 1916 3246 1686 0746 3486 2476 2366 0766 0016 1476 1706 2726 1955 8805 9216 1976 0696 0296 0866 0696 0215 8905 8095 7705 8855 7625 7635 8475 7215 7195 5845 7665 6675 6845 6095 3905 6735 4115 5575 4465 7035 5945 5095 5035 5265 3935 4655 4575 2225 5145 5385 3615 4235 5345 6035 4465 2865 2155 1195 3555 2595 0745 2015 1745 1255 1675 2855 2105 2215 1865 0215 1254 9644 9585 0474 8834 9824 8064 8374 8804 9374 8204 9254 9714 6054 8294 5674 7454 7414 6864 7514 8304 5744 6674 7324 5754 5184 4504 4764 5484 4674 4674 4744 3034 4054 5114 2934 5404 5294 4994 4074 3854 2894 3864 2654 3404 2544 1404 0834 1184 1074 1764 2344 2274 1894 1484 1164 1364 0594 1374 1014 2014 0954 0824 0274 1014 1464 0563 9524 0564 1074 1524 0974 0274 1164 0284 0363 9914 0724 0003 9004 1004 0053 8533 8864 0363 9503 8554 0533 9403 8383 7763 8003 8853 8633 8173 7483 8473 8823 8283 7153 7293 7613 7463 6913 7373 6793 5723 6793 6893 5093 5903 6133 5583 5093 5943 5143 6143 5773 4333 5223 3583 3793 3883 3953 3403 3783 3193 2443 2653 3003 2643 2643 3613 1803 2983 3583 2153 2393 1713 1843 1723 2583 2803 3183 1313 1633 2073 2023 2273 1683 1123 1553 1663 1603 0923 1263 1723 0323 1013 0953 0183 0463 0202 9552 9682 9272 8362 8012 9002 8352 9222 8412 8812 8232 8792 7902 8512 7932 7312 7182 6902 6532 7452 6542 7022 5612 7122 5702 6632 6572 5932 5982 5392 6212 6352 6072 6242 6962 5472 6492 6132 6082 5092 5532 4922 5582 4552 6032 4222 4202 5182 4462 4802 4382 5052 4962 4632 4792 3962 5822 3422 3752 4032 4392 4772 4122 3922 2502 3192 3472 3612 3392 3102 1682 3482 4492 2612 3842 2692 2532 2342 2962 2082 2692 3382 3442 3422 2462 2592 3082 2712 1272 1942 2112 1382 1762 1412 2282 2062 2172 0972 1422 1022 0392 0712 0892 1112 0732 0542 0502 0552 0451 9972 0632 0052 0181 9651 9981 9921 9521 9071 9641 9571 9591 9171 9041 8981 9091 9101 8601 9941 9031 8831 9441 8531 9181 8711 8441 7831 8281 7751 9291 8561 9011 8561 8501 7871 7211 8221 7571 8251 8881 7461 7051 7091 6591 7561 6651 6691 6891 7061 6511 6421 6191 6641 6401 6171 5891 5541 5911 6771 6911 6061 6141 5821 5801 5571 5701 6631 6461 6841 5321 6281 5621 5701 5971 4721 5891 6121 5391 6491 5831 5981 5381 5341 4511 4511 4671 4231 4181 4771 4291 4011 4451 3961 3551 4001 3471 3921 4191 3261 3221 3151 3831 3141 3951 3731 2921 4061 3351 2851 3491 3281 3241 3721 3641 3331 2581 2891 2201 2911 3011 3301 3091 2521 2851 3401 2521 2291 2021 2601 2701 2541 2741 2581 2551 2041 1851 1361 2131 2351 1991 2011 2151 1691 1551 2161 0971 1051 0981 1531 1521 0561 1781 1111 1391 0601 0981 1761 1271 0881 1061 1001 0751 1491 0821 1111 0671 0501 1071 1061 0471 0461 0911 0101 0841 0581 0631 0851 0421 0611 0449309691 0081 0041 0291 0299679889841 0629869689669629941 0629961 008974948957919950952918955910951961937949910930855908860866969968964953908887917960876922911883879871905905856833928835854798818827832859827857820826797836811774803802846776731290 960100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

138 93300000006 149 018000146 830 56600000000089 499 2290000103 818 1880000204 067 6400000429 850 1750001 833 568 85100510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %18 718 85299.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %18 678 99899.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %39 8540.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %9 379 74250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %18 556 71898.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

69.7 %13 071 52269.7 %30.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

416 0985 5893 0828 3343 7793 9557 0696 1665 2888 4422 5372 4364 6813 7241 6297 2442 2292 3795 1805 8405 5099 3538 2706 43511 35822 0041 04679 9631 3911 3244 0653 3221 8256 2621 2511 4224 0243 1838169 023121 1965 5764 8039 3717 12616 06122 62914 43178 5594 3677 0495 4607 7333 6356 2496 2165 07225 1645 31110 84117 771 122051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.79%99.77%99.79%99.79%99.79%99.8%99.78%99.78%99.79%99.77%99.79%99.8%99.77%99.8%99.76%99.8%99.79%99.79%99.82%99.79%99.79%99.79%98.9%99.82%0.21%0.23%0.21%0.21%0.21%0.2%0.22%0.22%0.21%0.23%0.21%0.2%0.23%0.2%0.24%0.2%0.21%0.21%0.18%0.21%0.21%0.21%1.1%0.18%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped