European Genome-Phenome Archive

File Quality

File InformationEGAF00004921808

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

978 784833 676778 396817 058927 1081 091 7091 332 0901 643 8192 040 5802 515 5393 032 3163 609 8984 216 6224 860 3525 517 3806 197 7016 931 2507 713 2508 572 2619 554 57010 700 10112 023 07913 575 48915 349 03317 399 63519 669 99122 177 10724 885 90827 766 88230 744 24733 779 47136 801 93339 770 80042 598 17645 291 99347 725 82249 951 69651 862 95553 467 58354 768 39455 841 94356 596 84557 099 66257 392 29157 517 70357 462 80257 286 73256 956 44856 588 31756 188 45655 711 48955 223 73654 686 35054 124 23453 532 06452 959 66852 335 80951 670 02050 960 53350 183 34649 345 40348 396 36147 439 57446 347 54645 224 97344 031 61342 763 50141 437 25440 030 99138 595 05937 109 54935 594 41734 095 33232 572 50331 050 47529 527 68728 015 28126 544 12525 120 55123 714 39922 352 32321 030 02719 762 06418 534 05617 373 26816 261 74115 196 86514 165 92413 211 45212 285 15211 417 70110 610 1559 822 7429 098 2688 413 9347 775 0687 171 5906 600 9146 076 9125 583 5745 120 8514 691 3614 295 7343 925 9463 589 1153 267 2872 981 3372 714 4792 466 8782 238 6082 034 8351 844 3931 673 9261 520 8001 380 0251 249 8531 136 3611 035 298939 865852 106777 236712 307650 183596 961548 433504 858468 061431 672401 607374 362349 570327 606307 310291 094274 934261 958249 592238 021228 354219 379211 522204 107197 554191 286185 788181 767178 550172 935168 909164 872161 771158 795156 774152 268149 942147 489145 062142 798140 869138 403136 288134 568132 019129 311128 076126 220124 033123 491121 079119 183117 915115 253113 861112 894112 155110 894108 888106 791105 578103 853102 653101 02499 71397 72596 61695 45994 16791 54191 05788 77286 85985 84284 14082 62280 49080 13578 66477 51775 75175 16973 37071 97870 55369 76868 86467 26165 64664 59763 47663 14161 77960 89760 12459 29158 32857 43056 46155 57454 83753 39052 65552 09150 80350 50149 81249 30748 70548 24947 55647 14946 20645 89745 19944 58344 43443 48742 85242 52442 03641 87541 51740 97940 72240 09139 83839 75839 37038 94838 71438 02537 45736 91237 19936 52236 44336 33936 01935 85835 64235 14234 85234 10933 63033 68333 29633 31132 77532 33132 09031 91631 85931 78831 42331 02730 66330 40230 16629 86329 38929 21729 33428 78828 74728 44327 71027 70827 54526 94026 93326 73226 35126 30226 04825 82525 39125 47225 22425 01924 95224 60624 47124 36424 42223 95823 86223 81123 52623 61823 30523 28123 27222 93722 65522 60922 30722 16321 87721 88821 66021 19921 11920 95321 13321 05820 90720 95320 44020 36120 09719 65219 80019 52519 46719 28419 19619 08618 91918 72618 40218 38718 32718 33018 12517 99317 48517 35717 61517 38017 41617 04316 80717 18917 02916 63216 44516 50316 39316 14416 05016 31715 75915 62015 55615 47615 33615 35915 09014 79514 92114 64614 67014 25414 12213 93013 88413 64213 46713 46813 20213 23412 83312 87012 93212 74512 84312 40511 98711 88911 78811 51611 33711 17011 23611 20810 95910 88010 77410 51710 49910 35810 40910 1409 7049 7099 6019 4239 4719 6739 5099 4519 5159 0689 1789 1199 0648 7938 5178 4708 4998 5368 3848 2308 0498 0847 8937 8397 6807 7917 3927 5787 3907 5337 2987 4707 3027 3737 1567 1237 0216 8776 9106 7726 5976 7996 6756 4366 4146 5376 2126 4126 3426 3486 2136 2246 2576 1796 0286 0115 9546 1295 8855 7695 8465 7165 9525 9105 8405 9105 7325 5885 7865 5425 6415 5135 5745 5525 4915 3915 5045 2345 2985 3205 1105 1645 1585 0934 9324 9435 0804 9634 9994 9044 8994 8154 8594 8064 9144 8684 7664 6694 7504 6574 6724 6554 5124 3994 4144 6924 6384 4954 5794 4034 5204 5084 4824 2734 3014 2904 2194 2914 3054 2184 1224 1604 1184 0353 9824 0273 9553 8424 0333 9843 8783 9683 8043 9433 9103 7583 8243 7883 7423 7873 7673 7783 7693 6083 6803 5753 5823 5823 5753 6533 5713 6453 5643 5383 4073 5313 5773 5033 4153 2493 3753 4163 4263 3103 2483 2873 3673 3473 3693 2573 2823 1713 1713 1553 2063 0613 1973 0603 1113 1363 1773 2443 1203 1243 1413 1443 1322 9963 2113 0462 9693 0492 9973 0763 0713 1803 0312 9572 8462 9772 8952 9542 9082 9502 9052 8692 8382 7132 8382 7792 7142 7132 8422 7572 8782 8122 7152 8342 7262 9552 7912 7872 7092 7902 8762 8942 7112 6422 6662 6172 7452 6442 7092 5982 5922 6122 5662 4902 5522 6482 5812 4412 5282 5902 6002 6042 5732 5732 4932 5642 4672 3852 4712 3312 4362 4002 4142 5362 4882 4812 4072 4012 3812 4882 3252 4202 3532 2762 3112 3642 3752 4662 2392 3322 2852 2442 3142 2612 3752 2182 3372 2362 2612 2092 1342 1162 1142 1462 0482 1702 0032 0452 1412 0382 0842 0892 0091 9871 9852 0772 1122 1062 0372 0322 1072 0252 0192 0202 1062 0001 9982 0262 0482 0171 9551 9631 9841 9481 9691 9521 8621 9881 8752 0562 0051 9371 8831 7511 7821 8861 8571 8151 7851 8261 7241 8061 8241 7581 7741 7651 8091 8101 7651 7941 7561 7831 7461 8141 7371 6901 7631 7021 6661 6501 5861 6631 7241 6901 6401 6211 7131 7591 7421 6971 6961 7041 7241 6571 6291 6521 6681 7271 7551 7071 6641 6341 6331 6511 6841 6701 6211 7571 7231 6981 6791 6301 6241 6141 6941 6971 6631 6781 6551 6631 6811 5831 6441 5921 5761 6661 5771 6271 5371 5711 6201 5831 5241 5271 6511 5791 5761 6631 5891 6521 5901 5911 5811 5881 6041 5651 5201 4621 5321 3651 4141 4181 4551 4711 4101 3921 4551 3921 3721 3811 3881 3751 4211 3771 4401 5061 3901 4681 4081 3871 3591 4071 3621 3481 4031 3691 3201 3991 3241 3871 3931 3501 4141 3521 3071 3321 3721 3371 3271 3271 3481 3201 3701 3371 4121 3161 2471 2811 3561 3301 2931 3251 2981 2841 3241 2901 3161 2361 2891 3021 2381 2141 2211 2501 2491 1721 2461 2151 1871 1631 1901 2861 2121 2101 1831 1731 2001 2521 1271 1631 2661 2151 1561 2031 1321 1581 2281 2611 2271 2051 1671 1331 2181 2081 1331 1621 1821 1291 1651 1671 1341 1801 1611 1661 1421 2081 1581 1151 1781 1031 1251 1081 1371 0601 0971 0901 0991 0791 1221 1481 1001 1481 1601 1431 1961 1551 1321 1371 1631 1501 0821 0741 1141 1721 0741 1921 1171 1331 0561 0971 1601 1511 1361 0781 1291 1831 1531 1371 1001 0971 1081 1191 1001 1151 1061 0731 1421 0491 1321 0771 0091 0311 1041 0971 0591 0121 0581 0621 0701 0201 0551 0069419851 265 522100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

8 772 14500000000004 084 554 21400000000000006 799 237 24600000000000159 579 045 54900000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %1 127 702 45099.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 127 053 11899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %649 3320.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %564 475 52750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.1 %1 095 648 78297.1 %2.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

41.1 %464 258 89541.1 %58.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

53 138 1401 213 821799 5924 243 0321 116 2591 084 4771 391 3251 869 332823 2901 268 069557 946509 958606 889767 188631 079959 455645 618748 444888 3411 276 7981 353 9521 336 8791 700 7441 300 0372 029 0883 436 915220 3125 923 617320 979299 097538 796615 673310 181743 183301 853303 051449 225642 982176 9261 010 39213 630 099644 336568 9831 063 690859 8651 690 0981 461 3082 182 7193 872 514381 488544 349447 804608 748262 885496 885510 747332 1361 488 875327 089755 1661 012 293 727051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.95%99.93%99.95%99.95%99.95%99.95%99.95%99.95%99.94%99.94%99.94%99.95%99.95%99.95%99.94%99.94%99.94%99.95%99.93%99.94%99.94%99.94%99.95%99.85%0.05%0.07%0.05%0.05%0.05%0.05%0.05%0.05%0.06%0.06%0.06%0.05%0.05%0.05%0.06%0.06%0.06%0.05%0.07%0.06%0.06%0.06%0.05%0.15%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped