European Genome-Phenome Archive

File Quality

File InformationEGAF00008208163

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 148 6703 334 0482 937 0482 699 2632 580 5762 552 2802 653 5052 911 7483 409 5694 198 4205 334 4416 869 9018 704 77410 769 04712 897 93614 921 56516 674 74218 100 62319 187 55520 055 72720 982 00122 246 79124 209 05827 198 96531 522 81037 454 83545 085 76154 412 52165 243 02277 220 41489 881 243102 702 973115 016 461126 135 407135 457 742142 427 332146 731 084148 133 314146 710 321142 485 127135 914 229127 252 182117 074 903105 909 25694 258 60282 552 86571 238 59960 557 79750 777 95642 032 03634 378 69227 812 84822 278 60517 684 61413 953 25910 930 1118 546 4346 674 3395 233 0314 119 4753 276 8902 629 3032 146 5731 782 1741 509 8261 305 1591 144 0661 022 520928 615852 190789 872740 160698 553661 659631 862603 891582 152557 179537 483516 989498 065481 082465 301449 724435 715422 931411 771397 575386 558371 810360 973352 296341 914331 484321 933310 278301 346292 411284 547275 922269 200260 542253 149246 579240 816233 811226 635219 401213 047206 994200 720195 624189 744184 394179 383174 326169 653165 541161 618157 428152 293147 909143 951140 867137 123133 051129 540126 738122 740119 322117 035114 490111 839108 351106 251103 907101 42299 18196 42394 49092 06690 58388 32587 15585 34182 52381 55179 81877 73876 45375 55672 78572 62570 94269 29167 69666 20665 55364 13662 74460 94760 22659 05958 00556 93855 87054 99553 70353 48452 51951 61051 25250 75949 41848 22748 55947 85647 31846 72245 80845 30945 07244 24944 11643 49842 62942 08241 10141 50340 66940 34539 86339 18739 06838 46438 41937 92537 46337 06236 05535 93535 60435 32234 36434 58634 03933 69032 90432 81831 92931 12630 97130 73030 36230 09729 35529 11229 10628 58228 49828 01127 99327 42627 06726 84225 74825 66725 63525 48625 10324 37424 10624 09523 69223 15823 18022 62122 45522 27722 17521 57521 66721 76621 51521 54420 51420 78820 39520 26520 34420 05619 96019 54819 69919 36518 95819 16218 44218 25518 24418 32718 33117 84817 58117 08417 00416 76716 61616 76216 48716 36816 59916 21615 97115 86015 70115 38615 27714 98014 99414 69914 32513 90414 04814 05113 77713 91213 76613 51213 43713 22713 08112 85413 04712 94412 68312 36612 42512 33912 34412 14611 94412 04311 57911 56811 26611 09911 15810 90211 08110 98010 90310 87910 84910 94910 80310 51810 62310 60810 59310 36610 30710 04710 0269 7849 7929 8049 6539 5419 5579 3779 2799 3749 1379 0858 8298 7918 8088 9909 0068 7378 7678 6218 6548 6588 2908 4248 5018 3118 2768 2647 9837 8557 9987 8177 8647 7567 9567 6027 3357 7227 6337 4997 5477 5737 4317 3197 1286 9997 2837 2967 1587 1927 0507 0656 8407 1486 9906 9856 8586 7946 7386 7996 8786 6826 7306 7316 5726 4176 3176 2986 1706 2146 1476 1616 0996 1736 0736 0485 9515 8965 8395 8115 7995 7365 7165 6665 7215 7115 5665 6105 5265 5035 5145 3065 4145 3685 1765 1525 2295 1315 0755 0424 9644 8764 8154 8504 7264 7504 8354 6644 5034 4784 3494 6914 6884 7114 5654 5764 5264 3934 5474 3694 4414 3794 3944 4564 3064 2914 1974 0224 1484 4464 4054 1974 2244 2604 1954 0874 1434 1734 1213 9964 0263 7224 0013 9603 8943 8883 7583 7563 6343 7573 6583 7763 7003 8063 7173 7003 7113 7083 6993 7903 6173 6393 5163 4733 3993 4813 5333 4333 4093 3503 5023 4093 3473 3583 3723 3463 3053 2233 2743 3913 3613 2943 1713 1383 0893 1643 1893 2093 1203 2813 2573 1843 0173 0943 0082 9323 0173 0162 9762 9742 9602 9673 0652 9982 9263 0032 9613 0242 9952 8562 7962 8492 8502 8482 7222 9792 7322 7742 7822 7862 6352 6792 7562 7522 6642 7252 7202 6832 6632 6362 6062 6772 6132 5292 5982 5482 5032 6082 5272 5832 5962 6822 7482 6312 5022 6132 4632 4822 4932 4512 4222 3932 4572 3792 3772 3982 3622 5082 3142 3672 3682 4122 3702 3502 2382 2562 2272 2382 1912 2042 3262 3532 3652 2512 3082 2762 2242 1022 3002 1622 2412 1232 1572 1012 1432 1052 1172 0762 1022 0552 1412 0892 0702 1502 1612 0422 0772 0662 0702 0981 9942 1362 0511 9972 0361 9652 0872 0991 9862 0241 9461 9201 8771 9872 0271 9832 0371 9821 9411 9252 0561 9811 9791 9562 0101 9071 8891 9381 8981 9771 9041 9472 0312 0431 8541 9061 9992 0161 9111 9731 8891 9702 0091 8281 8731 8041 8071 7981 9111 8831 8411 8871 8051 7741 7331 7151 8621 7711 6591 7141 6811 7671 7501 7491 8211 7521 7111 7191 7651 8531 8061 8011 7701 7491 6971 7311 8161 8121 6761 7261 6921 7421 6471 6201 8781 6171 6981 6011 7021 7371 6601 5921 6361 6331 6781 7581 7201 7231 7571 6531 6051 6121 5151 5301 5901 5851 5501 5071 5721 4811 5581 6661 5351 5011 4661 5471 6131 5121 4761 5821 5591 5461 5561 4711 4721 4821 4971 4571 4741 4771 4101 5291 4721 5241 4941 5481 4031 4081 4651 4861 5151 4671 4321 4001 4071 3641 4981 4581 4341 4011 4541 4501 4231 4821 5001 6061 5011 5791 4481 4791 4351 3821 4691 3791 4161 3771 4421 4901 3741 4551 4171 3661 4421 4301 3521 3291 3641 3561 3271 3821 3841 3801 3601 3301 4531 2901 3271 4291 3941 4291 4361 3961 4351 4081 4711 4161 3521 3751 3281 4341 4461 3781 4311 4081 3691 4091 3781 3261 3871 2551 3761 2611 3261 3581 2431 3071 3111 2791 4161 4281 4001 3831 4261 2651 2961 3561 3691 3371 3101 2521 2651 3251 3701 1881 1891 2871 2581 2081 2301 2661 2531 2801 1841 2651 2351 2481 3041 2301 2921 3041 2431 2811 1951 2161 2601 2171 1961 3041 2081 1761 1901 2081 2881 2311 2391 2141 2451 2211 2361 1931 2281 1971 2211 2501 2511 2311 2651 2291 1541 1761 2611 2121 0871 1401 2191 1861 3051 2361 2681 1961 2911 1881 1551 2231 1881 2441 1621 1331 1841 1811 1801 2641 1081 1521 2351 2811 2591 1511 2331 1921 2391 1811 1131 0601 0991 1371 0911 0451 0801 0891 0661 1531 0631 1661 1191 1281 0221 0611 1171 0951 0911 0381 0281 0521 0551 0861 1051 0851 0251 0301 1429621 0371 1331 0599951 0871 0771 1231 0771 0831 1341 0161 0841 0251 0661 0761 1041 0961 0501 0561 0631 0591 0191 0901 0631 0161 0421 0439501 0109491 0029991 0241 002991911 753100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

6 439 83100000000004 031 430 27800000000000005 980 887 19400000000000105 754 072 47900000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %765 503 31899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %764 597 88899.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %905 4300.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %383 353 74150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %750 063 29297.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.7 %35 899 7814.7 %95.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

44 317 5331 039 837867 6961 247 213976 002992 8321 064 6301 277 326845 166802 713470 236408 248529 693582 944486 533777 087615 683656 582648 666773 120822 001825 9651 092 237768 8931 082 0161 597 536179 6733 234 204212 808194 215390 329332 553211 163428 160194 389180 937272 461330 473122 149525 0928 707 889448 907361 305691 508573 086958 0281 092 648887 1641 996 730218 456329 507260 282357 217185 120333 272340 365240 850942 954220 112470 789685 613 556051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M450M500M550M600M650M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.87%99.89%99.89%99.89%99.89%99.89%99.89%99.88%99.88%99.88%99.89%99.9%99.88%99.88%99.88%99.87%99.89%99.86%99.87%99.86%99.88%99.85%99.52%0.11%0.13%0.11%0.11%0.11%0.11%0.11%0.11%0.12%0.12%0.12%0.11%0.1%0.12%0.12%0.12%0.13%0.11%0.14%0.13%0.14%0.12%0.15%0.48%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped