European Genome-Phenome Archive

File Quality

File InformationEGAF00008060638

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

6 966 3803 781 2053 014 0672 694 3982 486 4502 365 7922 257 1722 175 4112 104 6292 053 7872 004 7751 969 3441 931 0021 918 3551 908 7591 907 0361 912 7061 918 9101 947 2071 984 5922 019 3392 083 1382 153 1022 238 3052 355 9732 496 4882 675 8062 885 9923 142 3223 458 4423 835 8624 317 5054 886 0145 568 6396 397 2407 392 6938 550 8999 900 31311 461 39813 276 23915 278 33117 516 48819 989 07322 711 14525 634 50728 785 09932 087 39735 612 42539 243 69243 036 38946 920 57150 862 37254 844 94858 834 16562 790 41366 640 83570 351 16273 916 16577 227 02980 203 62682 824 14985 020 47386 750 70387 964 40588 579 60088 642 71588 109 61186 977 41685 281 77383 001 09980 259 34777 069 02273 533 87969 668 11365 579 57961 341 64557 012 52352 667 61848 383 37344 197 10840 159 04036 309 62432 669 71929 268 66826 101 03923 192 13520 541 13918 143 94315 970 80214 034 45512 299 00610 760 9079 401 1238 192 5317 142 3896 208 5755 400 1964 695 8714 077 0133 539 7513 074 9862 672 6632 326 9712 025 4451 763 9981 544 1441 345 7871 178 8221 040 430919 406816 330728 366650 886584 735528 877483 482443 428408 104378 103353 675332 529313 326296 820280 641268 104255 337245 636237 119227 580220 463212 456206 031198 397193 996188 174181 701175 698171 504167 263162 505157 264153 341150 055146 140142 333138 826136 243131 503129 276126 585123 539121 703118 267115 685112 818110 168107 893105 363103 873102 32299 95497 34796 10394 85292 68690 87989 04988 04686 29185 52283 36481 79679 75578 43777 36375 28973 70372 68872 45471 59770 50769 72868 66467 97667 18567 03865 61264 85363 96763 51162 18861 34361 28860 41559 97158 88858 48057 82157 11356 02155 58554 36254 89854 22352 98752 65952 06651 27250 48749 54949 08348 80247 42546 76846 18345 86744 95044 31643 49443 41742 78342 20841 46740 59539 90139 29638 46338 87538 55837 73937 17737 12536 82935 80935 41235 06034 08633 89933 72733 28632 91632 23931 66631 78931 14130 35229 86329 49529 19328 43528 09827 71127 31726 97226 44026 03925 95525 63525 39025 29724 90224 04223 89423 60023 53722 95122 94622 61222 28421 61721 94721 61920 95721 25321 38420 92820 68320 44719 91919 93219 39819 35819 43119 07018 88618 96718 72818 44518 15618 20717 58717 50617 55217 29116 91216 83516 74216 55616 38116 15715 51815 78815 77015 50915 42615 21015 28015 24814 91115 14115 00014 83314 82014 37314 24414 27213 99614 20313 79613 67413 65913 42713 28013 44113 16013 09612 77612 80512 70712 66712 39012 26812 41411 90612 16911 70212 10611 59411 41811 44211 09210 94611 01011 00110 77010 79310 73910 59210 4189 99510 29710 33110 0939 8529 9039 5179 6199 3359 3979 3319 3068 8719 0479 0278 9018 5508 6758 6888 6328 1748 0018 1038 2217 8167 6937 6577 7187 5887 4927 1887 3667 1046 9967 0436 9196 9046 6996 8306 8696 7096 6816 6986 6796 5776 6176 3256 4246 1936 1036 1116 1095 9735 9456 0055 7085 6485 5765 5215 6675 6965 5775 4585 4325 4465 5655 2655 2605 0685 1995 0865 0225 1544 8115 0165 0714 8714 9114 9314 9734 6234 6064 8114 7124 4764 5564 6014 5474 4824 5264 5604 4664 3144 3854 4414 2714 3554 2554 1254 2584 0574 2314 2374 2534 1693 9874 1764 0993 9763 9153 8523 9833 8983 7863 8743 7993 7673 8553 7783 6863 7003 6853 9313 7223 6713 7113 7913 6793 5993 7153 5493 7083 7233 5893 6363 5673 5593 4013 4113 3763 5183 5143 4183 3913 3193 2883 3723 4063 3403 3373 3303 1613 1923 2593 2233 1863 2343 1773 0673 1243 0603 2062 9562 9442 8653 0592 9552 8572 9973 0653 0572 8922 9372 9622 9542 9232 9592 8452 9342 8752 8382 8502 9482 8292 7422 8272 6922 7512 5572 7372 7822 6022 5862 6222 7112 6072 6442 6072 5322 6012 5702 5502 4202 4752 4282 4432 4062 4362 4612 3332 3832 3402 3612 4352 3282 2802 3152 4742 5772 3292 2532 4192 2922 2632 1322 2802 2852 1442 2632 2122 2352 1472 1252 1612 1542 1322 0252 0392 0901 9702 2142 0981 9442 1621 9772 0911 9941 9752 0732 0141 9611 8681 9141 7721 8841 8351 7371 7951 8401 8421 9001 8761 9071 8521 8831 8381 8861 9041 8221 9232 0241 8031 7551 7681 7741 7991 7531 6981 7471 7241 5921 6941 6681 7431 7541 7511 6721 7451 7511 6421 6941 7231 7601 7271 7191 7711 6571 6911 7081 7061 6081 5871 6701 6541 5611 6891 6581 6181 7351 5621 6891 6531 6311 5341 5611 5351 4651 5331 5501 5941 5341 4961 4301 5601 5901 5641 5681 4691 4591 5521 5241 4181 4381 4501 4661 5331 4261 4341 4801 4871 4081 4001 3721 3971 5111 5031 4031 3981 3421 3461 3261 3821 3411 4171 3831 2951 3721 3331 2671 3551 2891 3671 2991 2841 2351 2231 2531 1891 2671 3371 3071 2741 2841 2061 2071 1631 1771 2251 2051 1891 1851 2761 2161 2491 2421 0731 1551 1701 1681 2171 1501 0971 1421 0911 1761 0761 1141 2011 2201 2561 2961 2771 2101 1941 1211 1791 1651 1381 1651 1421 2141 1631 1601 1341 1221 1231 1491 0951 1211 1641 0591 0769801 0111 0661 0221 1201 1261 0561 0131 0111 0791 0121 0431 0101 0641 1001 0701 0451 0241 0181 0781 0251 0591 0019851 0479621 0451 0311 0881 0359731 0341 0299288909759729609729779999679741 0209188978789328781 014889872856900887896830860903883876781864873863838864834848852837805745840859881883849847879937798870921896851877872837822843871878786860904846900802831857853847877895945853814882854840807780822853853825799826820837893805872861834889867866789840808827850818788735784750794769789759749752766776775741788734753754718803778710736698744702765788741778699784738758772792738723744700732719688729749715733704706704734816718770677712673717665670666655724719662710725701681648655657653735691693658725759757736738726751698661658657696736622725629625796 094100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 127 51300000000007 374 523 460000000000000010 821 305 05400000000000175 453 522 54100000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 279 642 39499.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %1 277 032 96499.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %2 609 4300.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %641 236 68450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %1 255 586 16697.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.4 %68 886 1015.4 %94.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

45 329 652895 315687 7661 227 458801 545814 921988 1421 455 064901 497804 301405 975354 686511 030525 315426 395744 147503 061575 180602 877690 576706 832831 0081 077 252836 8591 264 2071 983 971183 7634 645 534218 024203 835502 863370 635256 712506 214217 607204 772339 356385 997149 712675 09712 988 302664 986539 9831 038 609851 0881 449 4111 694 2891 406 9513 217 856323 688531 139398 238581 985302 395581 566507 863372 9511 489 416355 933767 6891 187 242 092051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.79%99.81%99.82%99.81%99.81%99.8%99.8%99.79%99.79%99.8%99.8%99.82%99.8%99.79%99.78%99.77%99.81%99.75%99.78%99.76%99.8%99.58%99.35%0.2%0.21%0.19%0.18%0.19%0.19%0.2%0.2%0.21%0.21%0.2%0.2%0.18%0.2%0.21%0.22%0.23%0.19%0.25%0.22%0.24%0.2%0.42%0.65%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped