European Genome-Phenome Archive

File Quality

File InformationEGAF00008041454

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

3 891 1542 889 9012 522 7972 349 9842 247 2232 176 3572 145 8742 121 2002 120 4612 145 6182 210 3632 327 2592 532 6992 839 2913 274 8823 856 2124 612 1055 514 9316 599 2567 802 1939 067 63510 378 83511 646 13512 808 27013 819 08914 683 59915 387 72315 957 02416 486 33916 983 74717 569 85518 384 67919 462 99220 957 86022 884 41125 359 75828 387 69531 996 09536 216 18540 985 65946 212 08151 870 16457 868 20464 056 37970 318 48276 437 69682 329 01587 771 76292 721 95596 902 196100 314 434102 754 365104 173 138104 552 946103 910 751102 263 53999 661 64096 180 02891 940 01287 075 47381 776 09876 053 13670 144 97964 182 56558 221 15752 402 28646 790 75141 485 84036 506 33631 902 66827 730 14023 942 42320 565 05717 573 31514 944 01812 653 71610 686 0068 987 5947 542 1236 314 1965 283 3584 421 8033 710 2923 117 1992 631 2412 226 6431 894 3521 620 2361 399 0471 215 6731 067 529943 717845 190760 860692 119637 515589 207549 394515 093484 552457 151434 931418 121398 343381 113367 548352 398339 535328 616316 911306 468297 504288 275280 323271 295263 590257 438250 881243 543237 352229 290222 058215 592209 541203 583198 565193 701188 365184 543180 455176 051171 445167 141162 726159 219155 143151 603147 226143 397140 507136 962133 430129 975126 804124 239122 006119 656116 253113 341111 055108 906105 690103 799102 19798 96197 67496 19994 26192 49791 60189 15786 96785 41084 41483 03682 43780 16378 95077 04576 72575 30974 32872 62472 10971 37869 62968 86668 18467 26065 06364 52663 90862 52462 00261 54460 08658 86457 82256 59955 90054 78454 73954 05453 39852 27352 00951 00550 21649 81448 98548 61047 50947 05546 11245 37245 37544 66243 83143 17242 89541 77541 24240 84740 04339 43538 52738 07038 16737 49236 90536 14635 55835 35334 98435 13634 26533 75433 42833 09532 63831 91932 21531 33031 20630 76530 92429 83129 70029 43229 20728 95128 82727 91628 15628 34728 15227 81527 60527 22426 91027 04926 74726 79226 26426 40926 03825 97725 39025 23024 75224 25424 01123 80523 54023 18623 23422 76522 78622 53422 37721 99221 67821 89521 66221 16521 36020 88121 02820 57220 53920 27620 15720 17119 93419 94219 81319 45119 31318 98918 45218 46318 41118 34618 31717 58317 68117 57117 07216 89016 84116 43716 35915 87715 90915 86615 58515 37415 08815 00614 99914 47014 47414 59214 47313 99314 05714 32914 02513 77213 55413 41613 38513 23112 80612 86712 82412 94712 70512 44212 27912 16912 46511 88511 80011 84411 42211 56411 33311 54811 19511 10610 99610 73810 81410 33210 52510 18410 39710 23310 22410 16110 1699 9479 8959 9749 6809 8709 7509 5099 3569 2319 0729 2269 1629 1238 9848 7958 9258 7578 7218 7948 4718 6828 7198 5548 3088 3868 3278 1568 1338 1648 0947 6207 6877 5397 5187 5267 5847 4737 6007 2717 4797 2587 4297 4067 0966 9727 1106 9347 0436 8616 9236 7966 8576 7646 5866 7466 6476 4946 4626 4706 3336 3826 3336 3446 1576 2826 2666 1896 1866 2866 3976 2086 2376 1796 0716 0695 9896 1036 0186 1275 7825 5545 6035 6675 5925 6905 5805 5625 3705 3785 3905 3295 2475 4465 3075 3235 0485 2105 1865 1235 1465 1694 9335 0214 9864 7635 0904 9904 9795 0654 9444 9334 7334 8044 7864 8794 8574 9504 6654 6774 7394 6694 5244 4124 5304 4424 3844 4104 3654 3934 3064 4214 3584 4154 2774 2214 2344 1494 1764 1163 9043 9174 0213 9583 9754 0594 0274 0263 8673 9693 8793 9473 9123 8593 9443 9353 8373 7663 7863 7883 7503 8103 8753 7703 7473 6313 6043 5323 6733 5473 4833 4223 4313 3883 3523 3863 3853 3563 3183 2833 3553 2333 2683 3213 4143 3003 4243 2123 1843 0153 3343 1093 1803 0863 2223 1233 0803 1003 0883 0152 9592 9692 9392 9562 9363 0023 0612 9202 9943 0182 9122 9702 9332 9102 8912 8752 8412 7702 6772 8552 6872 6622 6772 8122 7972 7782 7572 7192 8422 6892 7302 7982 6692 6592 6792 6492 6782 5912 6852 6022 6382 6072 5962 6552 5772 6152 5492 4462 4192 4482 4922 5422 4862 4482 5102 4662 4892 5302 4382 3652 3992 4482 4912 4972 4652 2552 5222 5282 4892 4702 4012 5552 3562 3982 3242 4162 2942 4192 5082 3412 3352 2842 1832 2372 2832 1972 2162 2152 2902 2492 2462 1872 2572 2782 2002 2312 2612 2172 1852 1642 1952 1402 2102 2342 1762 1792 1272 1752 1561 9762 1432 0732 0642 0781 9492 0292 0732 0512 0652 1001 9532 0811 9521 9711 9561 9751 9382 0292 0821 9321 9941 9391 9531 9641 9041 9611 9262 0441 9511 8741 8401 8651 8561 7511 9431 8611 9041 7821 8871 8551 7991 8151 8031 7641 7431 8421 8711 7401 7661 7331 6811 6641 7451 8411 7251 7581 7401 7691 8371 7701 7981 8051 8101 7401 6771 7871 6831 6641 6511 6671 6511 7031 6711 7821 6181 5781 6531 5561 5881 5321 5421 5231 5171 5261 4381 4891 5701 5411 5321 6911 6251 5601 6101 4991 5701 5741 5271 5171 5331 5481 5611 5391 5411 4521 5131 4381 4611 4541 4611 4511 4571 4791 4851 5441 5241 5051 4821 5371 5391 4991 4881 4481 3831 5331 5041 4441 5161 4411 3901 5421 3901 4251 4761 4291 4521 4631 4811 4431 5091 4961 5361 5201 4091 4601 4161 4231 3671 3841 3791 3331 3101 3391 3441 2971 2801 3341 2691 3071 2591 3691 3551 3211 2521 3431 2991 4191 3661 3091 1891 2821 2581 2401 2371 3641 2941 3041 3181 3721 2801 2301 2071 2821 3351 2521 2131 2871 3351 2781 2411 3251 2501 3051 2451 2231 2021 2741 2001 2341 2661 1961 1941 1801 2131 1861 1281 1911 1851 1701 1891 1081 0741 0891 2101 1961 1901 1451 2261 1911 0671 1591 1891 2431 2201 1561 1551 0521 1361 0891 0861 1661 1341 1611 1301 0811 1511 1321 0881 0971 1131 0971 0761 0731 0971 0631 0651 1191 0591 0061 1281 0731 0851 0361 1311 0571 0541 0001 0151 0891 0571 0351 0841 0151 0601 0339579931 0231 0551 0281 0061 0621 0191 0561 0419691 0581 0151 0221 0511 0211 0691 0611 0521 0179751 0031 0181 0091 0371 0821 0371 0201 0211 0201 0258911 0159611 0519759911 0381 0099119729809851 0159719459579481 0089831 0219909109661 0089298779259909499049779929718899699288919649619739008868909721 094 940100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

7 054 22500000000006 019 929 27500000000000008 785 603 99100000000000146 246 265 12700000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %1 063 329 97499.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %1 061 097 77899.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %2 232 1960.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %533 307 45950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.2 %1 036 374 37497.2 %2.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.8 %114 900 40110.8 %89.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

55 240 0861 260 2931 036 1741 697 7411 178 0891 190 9481 305 6591 768 3251 071 344970 251566 337488 076636 224680 014566 917899 241726 657768 138773 891896 252950 4441 013 8911 391 493985 4371 436 3582 049 939231 2614 252 530277 732252 680524 272434 733286 323532 047254 795234 603360 153415 481165 317672 35811 810 669608 334500 695943 281772 8431 293 3011 479 4811 231 9182 766 698302 605464 689369 722511 000268 004539 167463 333341 0501 294 147312 991659 304962 598 507051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M900M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.78%99.8%99.81%99.8%99.8%99.79%99.79%99.79%99.79%99.79%99.8%99.81%99.79%99.78%99.79%99.78%99.8%99.77%99.78%99.77%99.78%99.82%99.38%0.2%0.22%0.2%0.19%0.2%0.2%0.21%0.21%0.21%0.21%0.21%0.2%0.19%0.21%0.22%0.21%0.22%0.2%0.23%0.22%0.23%0.22%0.18%0.62%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped