European Genome-Phenome Archive

File Quality

File InformationEGAF00006164902

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

948 011832 515817 214883 8461 003 8601 185 4741 456 0411 804 8242 255 2692 819 8993 484 9744 249 2935 114 6286 070 8917 090 2648 147 4579 265 93310 417 46511 619 57112 877 77314 212 55115 721 76017 339 40919 191 79021 300 98223 661 75926 409 05529 439 94932 822 95736 566 50840 612 85444 951 88049 497 29054 173 79958 948 32763 672 54568 293 22772 652 53876 711 62680 414 41583 583 14586 301 59288 351 17889 811 58890 692 55890 822 46190 354 29089 274 14487 648 18885 462 00482 753 47479 627 43276 210 98272 486 36068 578 36064 482 29660 285 12656 090 50051 905 70347 814 41743 783 74139 961 16136 300 07532 778 37529 531 50326 463 93623 630 02821 010 92318 615 18216 459 58514 468 39812 694 05611 089 6549 680 9338 420 2357 316 5596 336 4005 484 2774 741 9714 087 5743 525 5273 043 5022 622 4402 268 4011 959 1861 695 7681 474 4401 286 6591 122 194992 293874 105776 669692 937624 196565 487511 722469 429433 012402 958376 204352 775333 134312 550297 936282 812269 315257 329248 442238 317229 060220 946212 917205 613200 907192 671187 894182 565176 622172 762167 671164 976160 481156 419152 606149 259145 093141 028139 414135 608133 756130 067128 062125 269123 569121 563118 321117 397115 084113 026111 203109 345107 737106 995104 505103 295101 63099 80197 94096 36194 36993 34491 65890 03388 50487 03386 08584 40082 75480 29379 37676 98375 59774 85872 52971 42469 83368 81267 25765 65163 98063 16761 40059 45258 59458 81957 15256 38155 76454 33553 80953 34951 92450 78149 97449 36548 29848 22546 21846 21445 10444 46844 48143 60443 35642 56342 45341 48841 04940 42240 19039 67939 12837 89638 31937 94437 31937 08336 80436 65136 03735 77835 08835 66834 94134 44034 06233 61933 02232 67132 65932 17832 31031 33631 12331 30030 50330 01830 65129 99729 89729 30429 34928 77128 60628 05227 84927 97327 43527 17327 17626 68626 72626 63125 82825 91625 62525 37724 79524 50824 00723 93623 57223 49723 15622 85622 70622 21621 61121 64921 33820 97420 55420 38720 31220 16919 57719 30019 29919 53418 60818 58318 40718 47917 88517 90017 36216 92017 23216 77416 61416 16215 99615 82915 66015 32515 08614 95914 96914 37114 11614 23014 03314 04313 38913 64513 40213 37113 20913 16812 97212 66112 60612 54412 46112 43512 12911 83311 68311 60111 82311 22211 21411 10310 84810 96910 64110 71010 32610 45910 32010 16110 2039 99410 0889 82810 08510 0059 7619 9289 6769 7129 3129 2599 4449 2149 2528 9999 0558 7098 7838 5958 7888 5188 5048 1698 0628 1507 9018 2078 1117 7897 7507 5167 5747 5447 4367 2967 3417 2347 3177 3527 1387 1146 9216 9526 8356 6366 5266 7586 5356 6566 3706 4886 2716 3326 2876 2636 0746 0836 0346 1885 9506 0025 9345 9195 8785 6805 6455 6025 6055 6675 6865 6655 5695 5375 6135 6445 4435 4735 3855 1515 4145 1915 3555 1745 2705 0795 0454 9835 1675 1115 1485 2005 0605 0095 1084 9484 9464 8504 8294 8074 7594 7194 9234 9254 7124 7554 6994 6624 5024 5094 5884 5234 5884 4144 5374 4064 5454 2784 4224 3494 1674 3424 2854 2164 2544 1964 1633 9543 8914 2054 0644 0693 8963 8633 8903 9443 9583 9203 9063 7063 6883 7733 8323 7993 7843 6433 6633 7083 6523 7253 7283 6783 7523 6783 6423 5803 5273 5793 5643 3813 3043 4083 3183 4513 3383 4083 3943 3143 3853 3753 2573 4103 4123 2743 3163 2073 1803 1733 0873 1193 2603 0763 0123 0973 1133 1033 1122 9553 1112 9933 0083 0343 0002 9703 0023 0082 9102 9252 7722 7762 8602 9342 8822 8892 8642 9752 8862 8092 9162 8422 8662 7422 8192 7412 7282 6332 6732 8012 6442 7142 7242 6532 5452 5772 4932 5662 5872 5602 5572 5782 4772 6282 5652 5732 4812 3902 3552 3532 3762 2372 3252 3842 2922 2492 2942 3122 3582 3102 4162 3842 2822 3472 3832 3422 3832 3192 3962 3782 3112 2992 3322 2232 2942 1512 2212 0892 2572 2022 1462 1502 1442 1292 1672 2122 0532 1832 1082 0902 0412 1312 0442 0662 0502 0172 0871 9882 0782 0462 0502 0132 0081 9941 9992 1231 9501 8851 9991 9671 9611 9622 0581 9181 9331 9762 0441 9631 9802 0641 9081 8701 9821 9321 8871 8071 8671 7841 8551 8751 7941 9081 9431 8181 8801 8061 7881 8541 8031 7171 7691 7811 7381 9031 7541 7431 8141 8561 7111 8091 6981 7881 6971 6981 7001 8641 7041 9001 7231 7801 7711 7001 7001 6731 7501 6951 6921 7391 6721 7181 6081 6651 6741 6701 6301 6381 6861 5611 5861 5331 5291 5111 4971 5881 5961 5641 5071 5771 5831 5971 4711 5581 5061 6711 6071 5071 4771 5561 4911 5271 4251 5431 5061 5121 5971 5091 4401 4421 5001 4101 4211 4701 4351 4391 4711 4331 3881 4291 5071 5121 4801 5041 4661 4461 4371 4881 4281 4431 3641 4361 4901 3811 4471 4481 5371 3691 4041 4521 3691 4441 3931 3531 4121 5051 4681 3821 4291 4121 3591 4271 3821 4141 4261 3951 4541 4041 3871 4311 3921 4271 3821 3751 3891 4041 3321 3381 3441 3581 3061 2811 3051 3231 3141 3021 3001 2941 2491 2691 2221 2111 2351 3201 2601 2051 2671 2901 2071 3351 2091 2711 1891 2811 2291 2241 2071 2151 2121 2371 2851 2421 2821 2491 2671 3131 2711 1921 2041 2111 1571 1851 2251 2041 1601 1921 1151 1901 1831 1431 1841 1731 1661 1401 1131 1511 1571 1771 1221 1821 1021 1411 1481 1661 1581 1871 0791 0561 1481 1311 1781 1141 1291 1181 1141 1031 0931 1341 1701 1621 1231 0601 1061 0951 1181 0901 1151 0721 1011 0641 0611 0501 0651 1221 1201 0991 0691 0971 0601 0841 1701 1401 0591 0851 0891 1601 0931 0461 1241 0641 0871 0231 1311 1061 0551 0761 1361 1071 0331 0121 0871 0991 1061 0379929911 0601 0311 0311 0261 1021 0121 0389981 0611 0011 0001 0299641 0649921 0471 0011 0219779891 0759969541 0081 0581 0591 0649771 0631 0051 0429639829811 0049699841 0019509649849979769441 0399249741 0241 0079279399769889871 0249911 0309389709819851 0749949889819369739089699339319149559318958879099169589628999028628778348988758789258318738258628929258228361 154 144100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

6 453 05500000000004 583 719 04000000000000007 224 219 77400000000000131 443 965 27300000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G110G120G130G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %947 063 56199.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %946 122 17299.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %941 3890.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %474 365 42150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %929 335 82098 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

47.1 %447 063 03947.1 %52.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

40 045 820895 707536 2501 289 597824 224858 867942 8371 575 538587 765991 783436 133407 085517 333639 750370 261754 659525 597621 026743 5351 077 7271 136 5721 103 9991 348 4821 054 3101 691 8732 792 160155 3764 852 306237 245234 535470 527519 596229 777604 906248 834248 918400 952564 425135 280839 54212 113 404532 146474 049877 821728 4061 384 1241 192 2871 799 7583 168 815300 423441 121368 210518 138202 329380 938422 927271 6801 251 743279 312639 862853 341 095051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.88%99.91%99.91%99.91%99.91%99.9%99.9%99.9%99.9%99.9%99.9%99.91%99.9%99.9%99.9%99.89%99.9%99.88%99.89%99.9%99.89%99.92%99.79%0.1%0.12%0.09%0.09%0.09%0.09%0.1%0.1%0.1%0.1%0.1%0.1%0.09%0.1%0.1%0.1%0.11%0.1%0.12%0.11%0.1%0.11%0.08%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped