European Genome-Phenome Archive

File Quality

File InformationEGAF00008012502

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 486 1263 004 5732 420 5992 161 6391 991 8801 871 5901 793 1191 721 2441 667 4771 621 3681 592 2071 567 4831 542 3411 528 6281 520 0721 516 0581 526 9401 541 9311 557 1891 589 2351 634 3131 689 9901 750 2911 832 3541 926 1102 034 5262 168 1982 318 5522 496 9042 705 0212 951 9823 238 4603 573 5633 966 5004 410 5574 928 5225 518 8606 180 9986 925 4177 771 1728 735 7819 838 88711 113 19212 563 81514 231 57616 162 26818 399 33920 983 41023 958 28227 317 46731 122 48835 356 85540 022 57845 059 30650 467 74156 121 63661 921 23067 747 08273 462 77878 967 01084 127 49788 792 91892 792 61896 069 19098 489 214100 023 272100 682 580100 383 84599 200 89497 180 10794 343 71890 784 38086 711 75382 182 39377 275 19572 134 97066 825 57261 463 91356 192 58051 038 36246 064 05441 351 70536 943 87832 838 44329 080 15725 653 86322 558 57519 780 79517 304 85015 126 32813 208 53211 534 80510 062 7018 791 9157 700 9616 757 3045 937 5735 237 6494 631 8754 116 0943 667 1543 277 6042 938 6242 643 6292 387 1752 167 2171 972 2441 803 5091 651 8671 517 0871 401 5141 294 8841 197 7771 112 4231 034 942961 103896 068840 854787 173736 972692 878651 446615 659576 341546 074516 093488 902462 882439 690418 045398 016378 975363 106346 342331 123317 181304 134293 253281 193270 623260 915251 533243 470234 587226 385219 188212 040205 164198 762193 035188 425181 866176 489171 814166 854163 479159 373155 197151 261147 260144 833142 653139 882136 315133 944130 543128 933125 449122 819120 478117 589114 816112 349109 994107 607105 316103 329101 89199 10597 35495 83093 90291 30990 15488 34487 20085 30483 72083 02681 85780 71179 29176 86775 96574 86273 38272 31771 33970 21568 33566 93366 34864 99364 21162 92361 31660 77159 86258 76157 67356 31255 81654 93954 25153 13651 78750 90550 47149 21848 19647 72846 81645 58345 16344 98044 26843 71243 10242 50041 40140 86240 17940 02039 26439 02038 59338 26437 76836 72836 11436 38635 10635 02633 93833 07532 82732 96532 44232 51231 97431 42331 25031 05931 20130 59630 35730 52829 95029 20729 07328 85528 32628 47127 70127 76227 06526 79926 61226 24526 10025 92825 61525 29425 61724 99224 80224 94824 36824 43924 31223 74523 77923 50823 04623 11022 52722 41822 51121 78921 99621 68721 68521 21421 32021 07020 88420 80720 96720 43520 40420 42820 19619 49619 39719 02019 06319 34218 87918 58918 29918 48818 02818 17917 64617 60617 42517 46517 29417 11817 04816 94216 10216 42515 98916 00616 08315 92115 71815 46215 35315 35615 37515 34215 12615 05615 27514 79014 71414 58314 45814 45714 22314 30614 13713 96514 12413 91014 13413 97513 79713 64013 85413 24213 31713 15712 91913 12712 73212 65812 59412 51912 21012 15112 03811 93211 94211 88111 79211 90911 64911 81511 53311 42411 39911 06910 81510 83410 52510 49810 38310 37110 45310 38310 14110 1219 9839 8949 98210 1459 9069 5919 5129 7589 5989 4689 4629 3609 1069 1209 1088 9028 7728 5228 5528 4618 6438 4588 4818 3037 9697 9327 6497 6527 7637 7547 9107 6887 6857 6207 5027 6107 4017 4347 4657 3957 2327 1497 1577 2427 3127 0037 0117 0596 7166 8726 8436 8876 8796 9147 0396 7836 5366 7326 6796 6336 6806 4626 4466 4856 4966 5166 5086 4786 3646 2726 1456 1936 1176 2126 1766 2235 8226 1475 9216 0065 9405 8595 6875 6335 7025 6435 3705 4465 4915 4235 4295 3025 4475 4595 4605 3805 4155 2345 2265 0835 0235 0444 9504 9364 8964 8934 8164 6604 8804 8394 8994 6984 7704 6974 5864 6704 6654 5454 5084 6344 7484 4584 5844 5464 5984 4124 3494 5334 3714 2714 3764 3414 3014 3194 3224 2414 2684 2124 1614 0614 1194 1804 1834 0644 1514 1294 0444 0964 0544 0654 0354 1793 9654 0223 7973 6703 7763 7563 6543 6633 6613 6903 6583 5903 6893 7753 5933 4823 5363 5993 5283 5033 4763 4383 5313 4203 4553 4933 4523 4313 3963 4023 3473 2793 3663 1883 2463 3813 1943 2413 2543 1933 1833 1853 2783 3223 2243 1973 2403 3503 1463 1633 2773 0193 1273 0843 0462 9283 0752 9583 0892 9193 1042 9892 9403 0712 9152 9683 2123 0652 9933 0722 9182 9882 8862 8972 8602 7862 8092 9032 8922 7602 8212 7582 8012 7252 7732 7742 6872 7012 6212 6982 5322 6332 5832 5702 6202 5682 6012 5322 5032 5342 5432 5172 5322 5392 4672 5622 4642 5282 5062 4202 4322 3842 4342 5232 4112 4842 4412 5452 5912 4672 3142 4102 3852 4622 3462 2582 2492 4012 2532 3082 2772 3242 3252 4732 3772 1642 2272 2582 2962 2482 1652 2412 2372 2422 3442 2462 0382 0992 1682 0732 0672 0682 0072 0432 1002 0362 0512 0992 0152 0652 0032 1682 0622 1292 1352 0001 9132 0671 9711 9671 8961 9232 0852 0621 9262 0252 0772 1172 0221 9611 9432 0191 9512 0181 9701 9381 9511 8121 8661 8911 8761 9621 7161 8201 7851 8321 7621 7961 7261 7281 7431 8511 7191 8001 7301 8371 8151 7021 8491 8501 7101 7601 8531 8051 7741 7271 7141 7501 7371 6071 7131 6701 6631 6301 6541 6691 6181 5411 5991 6141 6361 6451 5481 5541 6171 6171 6061 7321 6201 5741 5591 6161 6351 5741 5931 5701 6311 5841 5651 5351 4211 5361 5361 5231 4231 5001 4701 4331 4371 5241 4761 5191 5241 5211 5191 5601 4311 5251 5521 5441 4791 5011 5791 5661 4901 5091 5181 5471 4621 4951 3361 4291 4481 3961 4261 3581 3941 3841 2961 3311 3201 3241 2751 3111 2701 2751 3101 3121 3751 3441 3791 2941 3701 3591 2951 3801 3271 2291 2651 2791 2291 3651 2711 2301 3101 2431 1821 2031 1921 1861 1921 1891 2411 2501 1771 1241 1371 0961 0991 1541 1191 1731 2371 1981 1571 1931 1941 1561 1591 1311 1371 1591 1891 1791 1681 2271 1681 1411 1921 1681 1421 1391 0911 1031 1311 1561 0751 0221 0551 1351 0791 1311 0331 0951 0391 0881 0781 1021 0611 0709961 1351 0661 0941 0541 0519521 0191 0971 0741 1191 0431 0481 0441 0491 0741 0189711 0511 0059911 0451 0431 0931 0781 0171 0409801 0081 1071 0441 0009331 0449431 0021 0061 0501 0191 045944959869932953924937961956963990897966958999943890890919877952979923964921938925936979964922903869946931873965870883935885880918959915972850956885833892877877999 784100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 573 31300000000006 466 863 10400000000000009 977 406 47700000000000185 778 399 66800000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G180G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 336 903 46099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %1 334 745 04899.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %2 158 4120.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %669 626 63150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %1 311 092 17497.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.5 %100 665 6647.5 %92.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

54 627 8011 101 492870 1871 436 304996 8391 017 2571 153 8251 580 4091 092 772922 366503 452438 134598 560635 849522 587863 730622 620693 197705 698813 717836 841941 8331 291 479980 2711 458 1112 261 894213 4834 985 987255 045233 878551 078419 499292 735541 458247 806230 289362 220426 188164 635727 57313 686 987693 971544 2421 092 980883 6801 481 5671 816 4611 389 0633 496 241323 685548 019398 421597 059290 471591 936504 203366 6861 550 128333 619765 2421 231 086 007051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.84%99.83%99.85%99.86%99.85%99.85%99.85%99.84%99.84%99.84%99.84%99.84%99.86%99.84%99.84%99.83%99.82%99.85%99.81%99.82%99.8%99.84%99.67%99.55%0.16%0.17%0.15%0.14%0.15%0.15%0.15%0.16%0.16%0.16%0.16%0.16%0.14%0.16%0.16%0.17%0.18%0.15%0.19%0.18%0.2%0.16%0.33%0.45%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped