European Genome-Phenome Archive

File Quality

File InformationEGAF00002144877

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

1 334 701929 366782 634755 253746 096770 511813 276877 482967 2011 084 9731 247 7601 439 1651 715 0322 073 6212 558 1033 241 7014 179 3355 497 6437 321 8969 793 81613 042 44817 262 09422 562 75829 045 36436 736 83545 591 03455 459 80466 172 14177 355 19088 729 96399 786 516110 191 377119 428 711127 159 463133 060 919136 885 012138 558 606137 998 829135 378 138130 703 166124 439 111116 897 588108 298 89499 044 99589 419 97279 715 10070 238 41061 118 96352 667 87144 859 76937 838 00131 608 14526 161 14621 451 34217 450 50014 087 94911 303 0339 016 7257 161 2075 668 0524 481 5793 548 2652 814 3442 245 3051 803 8131 470 0901 207 2341 008 242859 054742 918658 029586 524532 397491 585456 869428 587404 155383 447363 938347 873334 421317 942304 989290 487280 086268 445255 981246 653236 494226 787217 657208 594199 803193 032184 991178 630173 123167 473162 254156 549150 595145 791140 383136 658131 080126 900121 310117 722114 618110 809108 584105 555103 225100 69198 61497 92295 04693 09891 35589 08386 48384 57782 02780 77678 74776 74975 11173 48272 90369 83368 82066 75865 82964 88562 95161 72660 81358 69857 96756 84855 58654 22552 67851 67151 03250 12748 58147 98046 61946 00744 63143 07741 95542 10241 87240 59239 97639 05938 05936 59536 14535 57135 01134 45434 04933 04232 41931 72231 19730 35529 83029 91429 48228 67628 83227 94627 52427 13126 41525 83925 50225 25425 18224 82024 16023 91923 61423 49822 95823 29522 31922 70422 33021 59521 79321 22421 13820 92821 04320 44220 55420 31820 20220 34719 71120 00019 37918 99919 14718 53218 52618 18117 95117 66317 64616 97116 91917 23616 60116 58216 23816 00815 49115 27515 31915 38015 01314 85314 56514 53814 04513 89113 68713 74313 59513 66513 56813 43313 42613 01612 71112 61112 43912 13712 27012 10811 99012 16011 90711 72711 58811 41311 27210 75410 87710 51510 41710 30010 36710 15310 06610 1769 8509 7819 7029 8639 6469 4849 3079 2889 1819 1859 1468 9708 9498 9208 7778 6228 4258 3888 4848 4148 1978 1528 1257 9638 0237 8997 8107 8197 8257 7847 7567 5337 4167 5957 3067 3157 2757 1957 3067 1567 3017 3347 3017 3237 2077 1696 9867 1666 9807 0006 8846 7946 8546 7436 6816 6906 6436 5326 4426 4866 3266 1886 3836 2526 2616 2376 1966 1206 2716 3766 3476 2256 3506 2826 3706 5196 1546 0906 0935 9955 9326 0336 0185 9845 9975 9525 6855 7975 6875 6765 6535 4115 5745 5945 4885 3935 3175 3575 4885 3395 4945 4365 3035 3845 2965 7175 3965 5415 3715 2905 5145 2015 1975 1065 0425 0195 0674 9164 9394 9004 9494 7544 6904 6094 6484 6734 5874 3884 3834 3844 5034 3454 4064 3004 2554 1433 9824 0493 9653 9913 8443 6343 6663 7003 8673 7393 7673 6503 6773 7733 5293 5583 6423 4813 5753 7193 5113 4763 4433 5603 5213 4823 3633 2393 2183 2443 1783 3203 3323 2263 1703 1413 3433 1183 1663 2043 1243 1013 1673 0403 1523 0333 0423 0133 0003 0592 9492 9972 9662 7442 8952 9142 8662 8342 8072 7462 7332 6492 6152 6222 5422 4732 4082 5102 5232 5402 5902 5512 3782 5182 3782 5792 4492 5122 5042 4342 4602 4632 4272 3672 2392 2782 2192 3352 2362 1932 2022 3002 2512 2972 1402 1802 0882 1252 0372 0381 9061 9832 0051 9812 0131 8791 8961 8511 7431 7811 8411 8661 7861 7741 7401 8601 9521 8451 8291 7661 7801 7611 8661 8381 7911 7821 7111 7391 7161 6991 7901 7721 6511 7501 6491 6681 7841 7011 6381 6891 5501 6181 5891 5161 5331 4831 5001 4751 5711 4661 5761 5361 5261 5241 5031 5121 4931 4321 3801 5121 4781 3061 3761 3201 4131 3711 3611 3951 4151 4151 4141 5171 4551 4301 3221 3861 3751 3531 3781 3361 3261 4161 3271 3821 4151 3681 3291 3921 3731 3951 3381 4041 4721 4131 3271 4151 3361 2861 3431 4301 2301 2801 2171 2591 2351 2501 2581 2841 3901 2431 3481 3601 3691 3411 2521 1711 2191 2301 2491 2581 2751 2541 1671 1621 2411 2151 2011 1541 1751 1571 1871 2141 2241 1881 2391 2661 1681 1431 1881 0981 0911 1731 0771 1441 1211 0131 0549951 0961 0871 1071 0521 0661 0161 1111 0721 0471 1001 0991 0571 0651 0431 0861 0441 0461 0201 0521 0031 0551 0151 0871 0301 0099509599689589189669329379269329529889319049038978598689449078749779118838709748949791 010944955934928903902852854829850866855880830815821723750786759781785834810825783783760792814787793785794748752742741766806841799762804787755730750720754732787709786737731700729709687722638734738712699722654644633644674635618591619632703628653599605618629639636611620593594635597648605589611651676652603581697626642611610667592631616607572579579558573593572583608584629587624678634616563568570606607500528553569572594561564546583543518561526589561530594598609595588524584629609633608596626621552577542566547528472521489537514508537525484472500498467482499500471466465467491449454476498501477460524503465471516466524476460513493501517467522468439514529483464494501503477456505490485448406490462481447419449374470420425441457493425413429424503473499448456487528452465470465471465460433428470493446467423423451404424420447415455434432423424390390392411444480486483443523485448449535444395392437416386375395579 756100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0013 102 43300000252 817 7360003 838 527 2400000000002 376 034 34100002 851 018 62900006 391 730 914000013 412 503 64300086 771 993 56400510152025303540Phred quality score0G10G20G30G40G50G60G70G80G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %770 778 44499.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %769 954 65899.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %823 7860.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %386 359 09550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.6 %754 231 48097.6 %2.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

14.7 %113 602 00914.7 %85.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

31 069 6041 198 636969 1781 354 207825 546673 919766 723750 402337 358998 271614 270648 988794 440792 595417 683955 158539 302506 810657 789730 579497 712872 335771 100743 8221 169 9731 803 956151 8233 216 188213 490203 715348 571406 045120 464594 932183 879182 475331 932455 58988 455693 35911 092 160625 9553 562 699865 389592 377451 981163 082162 268551 830742 9624 163 6821 218 836919 304849 831763 3852 674 0831 134 263951 862809 208707 543686 518 143051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M450M500M550M600M650M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.91%99.9%99.91%99.9%99.9%99.9%99.91%99.9%99.9%99.89%99.91%99.9%99.9%99.87%99.87%99.91%99.83%99.89%99.86%99.9%99.58%99.88%0.1%0.11%0.09%0.1%0.09%0.1%0.1%0.1%0.09%0.1%0.1%0.11%0.09%0.1%0.1%0.13%0.13%0.09%0.17%0.11%0.14%0.1%0.42%0.12%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped