European Genome-Phenome Archive

File Quality

File InformationEGAF00008236629

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 071 0873 357 1082 757 1782 439 1832 225 2162 072 2451 958 6981 871 5671 798 5521 737 8031 694 5261 665 6981 641 7281 627 9331 622 9841 634 3631 649 0771 674 5211 706 4161 749 1521 799 5131 869 5421 941 5672 032 4562 146 1502 281 1312 450 9562 656 1112 912 7023 234 5633 626 9494 143 3254 780 6965 586 6586 612 1597 876 8469 438 06011 330 05213 603 67416 291 36419 426 84623 007 13927 073 79731 553 22336 456 77341 670 25047 143 08852 778 88158 464 58664 022 32569 460 23974 582 61679 227 75483 327 82386 799 56889 657 15691 730 53493 011 86693 581 25493 453 48492 640 98091 154 35089 118 87686 592 44883 612 95480 306 32276 707 99072 891 09268 952 25564 951 41260 923 09956 905 86552 974 76449 126 12545 438 48541 877 27938 485 26035 278 87832 237 47229 390 12526 711 72824 215 81021 887 43019 730 67017 771 74815 977 60814 324 59412 808 88111 422 74810 187 0869 053 3548 040 9887 144 3046 334 2995 622 3224 984 7404 414 3153 914 5073 470 9983 079 3702 730 1672 430 7532 172 8911 941 3821 744 5181 572 2251 421 1111 290 8851 179 6651 078 494989 390909 414842 369779 855725 870679 184636 467597 940559 335526 548499 644472 780450 881430 865409 703391 046375 165357 256342 300327 394314 854301 234291 820281 791271 465261 378252 523244 605236 148231 072223 900217 349210 570204 351199 004193 022187 501183 430177 848173 627169 129165 630162 550158 585154 229151 079147 498144 217142 318140 030137 355134 127132 085128 769126 363124 310121 540118 752116 936114 928112 916112 111109 406107 558106 469104 255102 713101 347100 00197 91795 74294 79292 06290 08889 36986 96086 57084 89183 55481 63679 98779 39078 01776 08475 46573 47872 24070 60170 17468 68766 88166 35365 76563 78463 05962 42161 11759 84159 15657 82156 97556 55255 65054 55453 78652 62652 24551 29350 02749 22448 60348 46447 33346 70946 12545 34845 18844 88544 57643 52643 25841 70441 57541 19640 78240 39639 63139 19439 17138 12237 62337 43837 48036 28936 37636 09235 97035 37935 30935 06434 76333 81233 83533 50533 08232 28732 19032 33831 77431 44931 01430 87730 63530 47530 04929 68229 18429 65429 30928 57628 40028 53327 62827 46227 02726 99826 67926 42026 44226 07025 91525 58825 56525 40625 02924 86624 54224 10123 91423 79023 71123 54123 18323 04622 79822 39221 82822 02921 52421 34821 37321 40721 50921 33821 18921 00220 83520 61420 57720 03420 08619 66919 61319 47819 28919 26318 61218 92818 61018 30618 19918 03817 80618 12517 95417 65617 78117 58817 02616 77516 78616 68816 51316 62715 97216 31615 84715 48415 71115 35515 20815 08915 08814 90814 77514 74414 89114 69114 77314 56514 30513 97814 28714 00913 78813 68613 80113 40713 63813 40313 24413 26513 17713 18112 82612 79312 61212 40812 71912 35712 49912 17211 95011 87111 93811 80911 73411 92011 84411 46311 13911 41511 26510 94710 78510 76710 78910 61310 70110 57010 50310 68210 51410 47410 29510 12210 1059 9109 8609 8019 9579 7879 8529 8029 5729 7789 5359 6229 3579 2838 8619 1759 1088 8498 8898 9198 9658 8348 6868 7238 7068 5968 6458 3278 3158 2608 1608 2928 1948 0957 9808 0667 9217 8287 9318 0778 0718 0748 0177 6767 5407 5447 5187 4637 4847 3897 3937 3187 1757 1077 0816 9797 0686 8246 9326 7646 8196 8427 1926 7956 8666 8826 4906 6136 7426 3476 3376 1756 4886 4166 3016 3246 3866 2506 1986 1046 1916 0596 0356 1496 0526 0495 8445 8545 9705 7736 0565 9475 7685 7865 7625 8895 4895 6755 5265 4465 4525 4385 2765 1855 2655 2085 2505 2625 1675 3315 3745 0905 2055 0545 1275 0365 0995 1014 9775 0475 2525 1404 9115 0154 9094 8544 6964 9274 8994 8054 6914 7114 7234 8044 6654 7604 5934 4884 5294 4414 5844 6574 6234 4674 4014 4334 3504 3914 4904 4824 4864 4254 4084 2334 5204 3434 2714 1954 3264 2674 2634 2634 1454 2204 1214 1074 0404 1724 0774 0544 0733 8744 0693 8774 2543 9764 0033 7133 9323 7913 8623 8483 8883 7373 7913 8243 8183 7223 7123 8023 5713 7163 7173 6443 7353 5753 6853 6373 6153 6323 6553 4893 4623 4613 6363 5773 5603 4513 4963 5763 4163 3463 4573 3973 4133 4283 4523 4073 4613 5153 4813 4063 3193 2963 2163 2783 2203 1163 0833 0273 1143 1433 1883 1353 2293 2433 1633 0663 0593 0913 0743 0123 0922 9802 8633 0423 0602 8892 9222 9122 9302 8642 9522 9532 8962 9142 8502 9692 8392 7562 6862 8552 8002 6972 7312 7392 7322 6892 8042 7202 7662 7572 7252 6532 7242 6812 5852 6552 7322 6782 5542 6742 5442 5002 4732 6912 4472 6042 5472 6142 5942 6072 6172 4022 5082 4582 4052 5382 4482 4902 5112 4952 4352 4712 3982 4512 3722 3992 2932 3542 4672 4752 4382 2502 2702 2222 2682 3502 2852 3162 3332 2652 1872 2362 2492 2432 3552 3322 2602 2192 2862 2422 1062 2442 2722 2012 2482 1632 1852 1712 1732 2092 2922 2992 1992 1702 1252 1282 1902 1312 0772 1892 1352 1762 1592 0812 0732 0982 1432 0781 9981 9802 0512 1211 9912 0122 0312 1331 9521 9641 9542 0141 9351 9861 9781 8261 9641 9051 8231 9181 8341 8481 8761 8201 8761 8041 8751 8621 9731 8581 8171 8251 8891 8511 8331 9481 9871 8991 7161 7631 7521 8371 8951 8011 8421 8441 7941 8521 9361 7891 7971 7981 7651 7081 7771 6911 7081 7561 6451 7801 7191 8161 6621 7101 6981 6271 6731 6621 6381 6801 6651 6471 6701 6781 6341 7101 6451 6001 5331 5461 5431 5031 6131 5821 5901 4971 5131 5971 5481 5251 5631 5301 5691 4831 5171 4721 4891 7161 4891 4681 4811 4201 4771 4781 4611 5101 5071 6111 5211 4861 6411 4951 4811 5141 4871 3571 3871 3741 3821 4601 5771 4741 3781 3381 4491 3841 3461 4821 4131 4961 3931 3991 3851 4621 3701 4791 4561 3831 4791 3841 3631 3721 4021 3181 4051 2721 3691 3611 2581 2591 3771 2741 3441 3031 2701 3031 2531 3301 2981 2661 2771 3071 3031 3041 3201 3301 2891 2721 3111 3231 3401 2671 3151 3061 2671 2781 1831 2041 1611 1541 1821 1671 1991 1841 2251 1901 1551 1461 1511 2371 1881 1621 2131 1681 2161 2351 1861 1611 0381 1241 1091 1191 2391 1861 1561 1451 1851 1851 1241 1341 2001 1381 0901 1371 1181 1541 1111 1611 0551 1041 1411 1101 1531 1441 0941 0951 1521 1911 1061 1361 1401 1061 0771 0911 1021 0731 0821 1041 0991 0731 0601 0699631 0141 325 367100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

28 915 08500000000006 795 056 66200000000000009 909 533 23700000000000172 721 301 71000000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %1 252 870 84799.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 251 621 49499.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 249 3530.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %627 333 79750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %1 227 492 80097.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.6 %45 779 6643.6 %96.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

66 275 3581 584 7081 295 5231 839 3471 470 8541 476 1061 575 9871 812 6651 290 2371 259 600703 789596 738787 105866 477689 0511 166 744851 162900 311957 5671 145 6451 233 7891 319 8011 647 0141 213 3761 691 4762 520 540279 9455 364 879334 407314 798631 745536 715328 029712 859303 261283 878431 852527 978199 607877 00514 235 843746 827602 5421 156 300950 1311 611 0741 800 0071 545 7693 405 272364 351559 086444 555601 679316 203542 823575 519397 8091 588 923369 297791 9451 131 717 378051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.89%99.91%99.91%99.91%99.91%99.91%99.91%99.9%99.9%99.9%99.91%99.91%99.9%99.9%99.9%99.89%99.91%99.89%99.89%99.87%99.9%99.64%99.72%0.09%0.11%0.09%0.09%0.09%0.09%0.09%0.09%0.1%0.1%0.1%0.09%0.09%0.1%0.1%0.1%0.11%0.09%0.11%0.11%0.13%0.1%0.36%0.28%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped