European Genome-Phenome Archive

File Quality

File InformationEGAF00008208205

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

3 705 2902 754 5362 369 3932 122 7251 977 1011 877 0481 812 3221 773 4541 742 2381 736 7491 734 1661 765 0441 840 0011 953 9732 148 1692 431 1862 831 4013 364 7594 061 3244 937 9875 982 8387 140 2848 401 0449 689 95410 897 97111 977 53512 875 98213 557 05113 978 15914 150 74714 128 65813 954 02513 725 17513 541 12213 487 92513 662 05014 165 17015 082 94916 459 37718 363 13020 852 26523 950 15527 650 31931 991 30136 935 60542 455 20748 481 90354 905 85661 633 15668 498 06575 317 00081 929 71288 144 55893 791 75198 667 134102 628 527105 530 714107 345 831107 986 202107 368 069105 601 481102 806 89298 993 93494 378 14889 056 86583 258 33077 077 54870 731 56864 306 88758 009 93251 922 67546 107 07440 666 79635 636 46931 012 35726 838 25823 117 14019 836 66216 938 51514 425 26612 244 04710 374 4538 797 6497 444 7016 299 0335 351 6084 542 4963 872 9213 313 2462 839 1002 451 8572 126 7731 854 8131 629 2411 440 1731 278 5731 146 4181 034 242940 565861 828793 018736 958687 268645 678609 297576 917546 450523 008500 715480 208462 969445 421431 212418 701403 252392 073379 579367 840357 232348 183337 025328 467321 883313 345305 638297 185290 360283 496276 160269 589261 974255 694249 838243 190237 599230 744224 554219 960213 584208 628202 637197 526193 208188 494184 229180 567175 862171 473167 213164 533160 033156 768154 720150 479148 004144 920142 589140 017136 660133 683130 717128 155126 578123 548120 882118 304115 380113 447110 977109 580107 584105 607103 509101 698100 88799 51797 73296 07994 29592 36391 46589 76088 35186 52885 33683 75282 50580 70378 88978 64276 75375 67774 50473 07772 16671 26970 75169 42967 71867 03865 96865 02764 49963 46662 42761 91861 15260 20359 24558 03756 90656 13255 22255 04954 47253 93352 93052 11651 56050 21049 82749 24948 48647 83646 98346 24845 64245 34944 12143 85243 34142 88842 04342 18541 36240 75340 73440 05639 85939 16538 77338 51238 01536 78736 78836 26136 66335 87135 78735 65435 06435 26434 98835 04434 47133 88133 84733 79533 46233 04632 54632 26132 11331 79331 59331 40930 86130 69730 20629 90729 54129 43729 31028 85428 46928 36728 27128 06027 90227 53327 28127 34026 96726 97426 71925 92425 55625 34625 49224 98625 21524 92924 97424 28124 13023 85923 64423 36922 91523 05023 11722 72022 80722 65921 98622 21521 72321 56621 25421 25720 99020 62420 24020 11719 82619 87519 64919 42719 25018 95418 54018 54718 28718 18018 19717 80717 79617 61017 46917 10317 17217 01316 77416 52416 41816 17615 99415 87215 51415 60515 27715 17815 25215 15915 10214 66014 53114 61014 39914 23914 23813 91814 04513 75913 59413 35613 27113 05812 98912 95813 02412 89512 69712 82112 45712 22412 32512 05112 16911 96611 73511 82111 79711 67311 38311 33811 47411 34311 28911 23211 33810 88710 97810 81410 92810 92710 75310 67310 73410 62010 48810 24810 2169 87110 13610 0809 94010 0369 7659 8299 4819 4449 3649 3499 3339 3409 1559 0229 1268 9899 0518 8008 9538 9098 7558 6508 5108 4908 5608 5408 4468 4688 4748 2378 3678 1928 1588 2058 2168 2788 2657 9608 1538 0948 1288 1068 2207 9397 9167 6267 7257 6817 6637 5307 5797 5457 3717 4747 4607 3167 2727 2167 1427 2297 0037 0317 0486 9326 8616 7866 8266 6346 8496 7746 5926 6036 6106 5596 3506 4906 5806 4306 5556 4716 6586 4236 4446 3826 1626 4286 2606 1626 2596 1416 2816 2066 1846 2536 0775 9146 0846 0095 9626 0705 9675 8325 8025 7375 9515 6655 8465 7505 7045 6405 4745 5975 4405 5395 2155 2725 3605 3775 3615 5025 2835 2245 1595 3475 3755 2005 0475 1355 0644 9375 0485 0494 8664 9224 7504 8024 7324 6994 6014 6044 5794 6974 6144 4784 5654 5304 4974 6184 6844 5374 5684 5474 4164 5414 4434 5494 4694 3454 3584 2624 1794 3394 2714 3544 2464 1004 0944 0924 1194 0423 9974 0854 1564 1114 1414 0883 9684 0223 9923 9964 0504 1403 9884 0163 8093 9643 9283 9103 9873 9573 8233 9583 7103 7983 7053 7223 7843 5593 6003 5133 5313 5563 5593 4763 4593 5113 3943 4583 4293 4183 3723 3723 3613 3363 4043 4593 5593 2903 2633 4873 3223 5433 5793 3563 3563 3133 5033 4263 5073 3473 3043 2853 2633 2143 1513 1423 1163 1393 1983 1783 1993 1543 1233 0433 1843 0152 9833 0893 0692 8913 0062 9352 8073 0602 8902 8582 9432 8282 9022 9622 8412 9092 8022 7412 7572 8202 7132 6942 7732 6802 6082 7032 5192 6522 7152 7992 7592 7302 6112 5642 5762 5822 6082 6082 5192 5102 5352 5852 5742 7282 5322 4782 4532 4892 4852 4552 5302 5142 4492 4642 3592 3982 3982 3352 4802 4212 3072 2782 4522 3392 3392 3112 3692 3772 2912 4072 2452 3572 4092 3782 3142 3292 2062 3082 2482 2572 2242 2522 2632 2062 2382 1442 0852 2042 0962 2302 0832 1942 1982 2412 2102 1972 1492 0952 1272 1082 0692 0702 0132 0942 0422 0701 9992 0622 0042 0932 0861 9621 9451 9832 1151 9261 9731 9761 8972 0872 0462 0001 9271 9102 0622 0122 0032 0231 9672 1042 0651 9542 0422 0401 8732 0401 9851 9401 9271 8651 8691 8911 8151 8741 8101 7901 8611 8681 8671 8611 8131 8111 7821 9331 8321 7441 8451 8091 8361 8611 9201 7871 7251 8061 7971 7741 8181 7381 7491 6901 7061 7361 7391 7591 7171 7881 7391 7091 7261 7571 7401 6941 7201 7041 8131 7081 6261 6411 6651 5711 7001 6911 6771 7691 7021 7321 6191 6411 6061 6081 6981 6431 5941 5021 5741 6221 6791 6461 6941 6931 5791 6271 5881 6141 5791 7181 5171 5581 5441 6271 6651 6101 6161 5451 6361 5801 5821 5701 5511 5491 5421 5461 5101 4741 5371 5121 5261 4941 5551 5531 4961 5281 4441 4761 5101 5391 4481 3831 4911 4611 3921 5281 5521 4571 4901 4281 4431 4411 4301 4211 3371 4131 3121 4121 4051 3761 3741 3641 4101 3941 3831 3751 3871 3821 2801 3381 3331 3461 2921 3231 3751 4191 3731 3291 3711 3371 3501 3231 3711 3481 3401 3181 3251 3721 2851 2721 2851 3801 3091 3791 3201 3891 3041 3271 2821 4021 2881 3201 3471 2471 3391 3641 3591 3211 3311 3501 3461 3661 3151 3721 3321 3391 3791 3181 2921 2881 3191 2821 3081 3631 2041 3401 2641 3001 3051 2391 2631 2761 2341 2931 2281 2071 2091 3161 2861 2581 2631 2491 2781 2801 2631 2301 2551 2501 2941 1751 1991 2261 2381 2121 320 984100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

6 342 92700000000007 648 971 079000000000000011 137 770 85900000000000158 124 782 84300000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 169 530 53599.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %1 167 940 67299.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 589 8630.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %585 820 75450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %1 148 874 72698.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4 %47 347 8834 %96 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

62 870 0801 358 9801 115 9241 567 1821 256 4941 280 6911 400 4731 659 3211 154 1711 078 967638 377552 644733 019787 604648 3811 046 787834 753897 205879 1571 026 5861 090 8281 114 1931 491 7691 075 2211 562 5992 339 399239 8654 873 653287 441265 266550 202455 413291 056591 870270 989251 137394 519464 114171 973747 56713 014 892657 570560 2551 041 481860 0491 474 5191 615 2571 452 5192 965 795328 593502 491400 579569 158321 963536 402530 963389 5611 438 073356 299746 9861 054 296 292051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.85%99.87%99.88%99.87%99.87%99.87%99.87%99.86%99.86%99.87%99.87%99.88%99.86%99.86%99.87%99.86%99.88%99.84%99.86%99.83%99.85%99.84%99.6%0.13%0.15%0.13%0.12%0.13%0.13%0.13%0.13%0.14%0.14%0.13%0.13%0.12%0.14%0.14%0.13%0.14%0.12%0.16%0.14%0.17%0.15%0.16%0.4%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped