European Genome-Phenome Archive

File Quality

File InformationEGAF00002395293

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

396 540297 920294 237312 789339 271386 823456 528557 157717 449944 4541 276 1791 738 9852 350 5233 105 2934 017 4675 068 2266 232 8657 496 4538 811 34910 162 68211 498 49412 842 38114 228 21615 665 62917 258 26519 022 92521 027 04923 369 05626 139 91829 360 04433 029 31537 119 38241 671 56046 576 86551 748 49157 102 06162 498 72667 807 67472 929 32277 701 92682 102 32085 951 38889 213 01391 785 58793 641 23994 776 35495 199 97694 865 18693 885 85492 219 66989 951 16287 139 93483 841 92080 104 00176 061 39771 756 47467 261 64262 616 69657 942 57953 253 03348 694 41844 242 08939 964 73735 878 36732 008 47928 433 03225 079 58821 988 38619 193 45316 650 94314 380 75912 358 58710 575 4169 010 9347 647 0496 456 5645 428 0804 556 1323 810 9703 186 1392 656 5132 214 4961 847 4771 549 3961 298 9641 095 926926 575792 799679 231589 276515 247454 696408 133369 185335 700310 439289 030271 788257 185243 385232 085221 718212 506204 722198 650193 630187 252181 457176 040171 326166 814162 053157 964153 850149 608145 273142 169138 099135 070132 198128 541124 965122 647119 132116 698112 884111 026109 052106 977104 431102 110100 15398 00495 96793 91291 63391 04289 08087 01985 65983 09481 53779 79379 00678 13177 47575 58474 13073 68172 70871 46470 14969 22468 49367 80366 66365 60964 81163 79563 28962 28160 72659 23759 21758 35257 27256 83055 95055 71054 83354 04353 25052 27451 41750 84649 65848 92148 63648 28446 70745 84445 26244 91743 74743 65042 54142 43941 39940 99240 22840 05339 26038 63737 79337 29536 61635 82935 15934 66234 20433 70233 18133 17432 56632 02231 87931 47231 29831 21130 57729 95629 08829 13728 79628 99928 31428 09927 34127 23426 75726 46625 78725 57325 43325 04024 42924 24723 71023 60723 32923 04523 02822 43622 16222 08622 42621 65421 52621 38320 93720 98720 48920 85520 61020 30720 00020 09019 55119 24519 30619 01318 55218 08618 07517 98417 89217 90517 49417 30516 96117 03716 90916 44416 44516 65816 00215 62615 68315 50115 30715 41115 11314 86714 61914 32913 98513 75113 77913 67513 17613 16213 21713 07513 04512 82813 01112 76912 68912 74512 50712 41612 47411 94911 99611 73811 84211 86911 49911 42511 21810 80511 21210 88010 92011 16510 89710 85510 59210 32210 55110 33710 34010 24710 1549 94910 0339 7219 7439 6939 8049 6079 4629 4469 4349 3019 3179 3778 9559 1388 6908 8589 0118 8488 5758 5828 4618 5658 4028 5998 6098 3358 3608 3308 2257 9767 9467 9147 9687 8497 6577 5727 5677 3767 5127 5087 4287 3437 4687 4357 3517 3237 1457 2397 0906 9056 9756 9436 9106 7776 8186 6126 8116 8376 8436 7046 5886 5176 6876 4846 6646 5726 5356 4256 4216 3296 3476 0336 1045 9966 0535 9545 8906 0675 9105 8675 8855 6925 8245 6645 6075 6925 5665 4215 5965 4805 4825 3535 4895 4755 5505 2065 3925 3545 3015 1765 0445 0365 1125 1665 0495 1164 9314 9055 0505 1155 1154 9225 0004 9574 9504 9254 8414 6644 5654 7294 5894 4224 6154 5324 6434 5624 4684 3894 3814 4094 4844 4254 3934 4274 4934 3424 2474 2534 0984 0533 9323 9923 9743 8113 8433 7723 7893 8043 9043 7873 8743 8943 8283 7703 7373 8753 7313 7513 6533 6883 7963 6653 6473 5333 5123 5943 5983 6263 5933 5673 5853 5503 4503 5093 3933 4903 4713 5193 3893 3313 3363 2473 3563 4003 3203 4283 3313 2793 2973 2683 2963 2693 2293 1923 1873 1283 0443 1203 1993 0733 1683 1163 1273 0893 1163 0513 0733 0062 9592 9282 8342 9312 9022 8922 8502 8442 8252 8312 7432 7462 6962 6272 6522 7382 6342 5792 5982 5542 4612 5502 5652 4742 4552 4422 4362 5102 5532 4832 4612 4582 3912 4122 4092 5262 5362 5472 4872 4302 3802 4622 3872 5032 4032 2772 3702 4282 4752 2312 3322 2712 3292 2222 2812 4602 2772 1572 3382 3472 3112 2232 2532 2472 1562 2122 2782 1192 0942 0752 0402 1232 1602 1122 0352 0172 1502 0332 1872 1052 0271 9751 9481 9951 9601 8541 8761 9511 8551 9302 0322 0162 0071 9761 9411 9211 9331 9171 9221 8911 8561 8931 8541 8341 7551 7891 8341 7511 7631 8001 7971 8751 7661 7061 7051 7971 7121 7731 7211 7441 7151 7721 7891 7381 7131 7411 6641 7591 7101 7001 6991 7231 7321 7691 7331 7141 7011 6781 5991 5931 5001 5571 5041 5361 5561 5431 5451 4921 4941 5141 4971 5031 5021 5081 4371 3671 3311 3301 3501 3101 4131 4141 3591 3541 3281 3521 2591 2741 2501 3061 3511 3041 2991 2411 3991 2431 2841 2331 2781 2021 2471 2951 3361 3211 2591 2141 2531 1801 2801 2991 2411 3471 2861 2821 2481 2811 1631 2181 2521 2271 1841 2391 2141 1791 1251 1791 1471 1341 1601 1571 1201 1901 1741 1261 1381 1531 1081 1421 0991 1301 1411 0521 1011 1701 1621 1591 1801 1451 1821 0801 1691 1511 1281 0681 1191 0901 0561 0291 1071 1601 0311 0531 0569831 0389591 0109901 0281 0229921 0601 0629791 0411 0321 0821 0951 0561 0611 1141 0971 1271 0421 0709781 0129991 0471 0401 0149299609449669859459891 015965947904908845897905911930856923960926900963874919941907955898906946876826921839888874875815844805853857860839880842875944859878940887832843938844905917828849855818896890824838850885757803826758768778778742796791786777789830852819875857779815851859801930791793768820792827790808815781754729709739701748763785768744797761759761738739718713739744659706659678674705733697715718678714661692686638617678716675734679662664680629656629676733772669703680686701671624672647656684618619657630577622639657647595658603663632622622624601589636637586606580595550616576555560570569564585561568571553555600587828 282100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0018 573 86500000750 956 5270008 067 193 0850000000004 541 590 34100005 290 007 055000010 089 194 379000020 503 664 06400097 213 900 92800510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %962 548 86099.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.6 %956 897 61498.6 %1.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.6 %5 651 2460.6 %99.4 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %485 016 82250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

94.8 %919 601 32094.8 %5.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

13 %126 567 75713 %87 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

47 450 1901 714 9301 455 6781 826 2761 229 679974 8771 251 7221 052 280541 5741 351 369992 245925 8701 201 2441 101 583632 7441 293 598855 128803 4771 028 108989 339695 6461 127 1981 025 956945 8371 508 3562 090 938291 9513 819 071377 854347 229643 446589 956241 074792 531342 831328 785581 513639 621200 460912 35414 106 901837 6094 092 9351 092 725826 802666 968333 098337 894782 567981 1484 764 6261 517 8801 166 5651 136 2861 103 3583 366 9121 255 2751 208 1551 070 894962 265848 666 2693 6912 9743 0573 4103 2923 0843 0832 8202 901941 2390510152025303540455055606570Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.42%99.42%99.42%99.45%99.42%99.42%99.43%99.44%99.45%99.44%99.41%99.42%99.43%99.42%99.42%99.41%99.41%99.44%99.4%99.4%99.34%99.39%99.45%99.37%0.58%0.58%0.58%0.55%0.58%0.58%0.57%0.56%0.55%0.56%0.59%0.58%0.57%0.58%0.58%0.59%0.59%0.56%0.6%0.6%0.66%0.61%0.55%0.63%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped