European Genome-Phenome Archive

File Quality

File InformationEGAF00007836734

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

1 102 3381 007 6411 017 9951 157 3971 370 7071 696 8972 130 7742 681 4433 355 3884 139 7475 031 5776 032 6017 107 6158 269 8099 497 27810 813 13812 177 37613 656 85815 258 28317 029 55418 957 39621 125 95623 537 72626 267 09629 267 21332 599 12536 239 00340 131 18944 360 62148 785 86053 343 45557 960 42062 623 48767 196 88571 590 56675 651 75279 383 90082 703 27085 486 73787 689 80289 319 17390 330 49490 688 20090 422 50189 504 65588 033 87486 086 54783 557 12380 627 04677 318 47873 759 81269 891 99365 870 85161 735 90657 553 66553 406 40849 234 32645 197 08341 344 56837 607 95534 035 35730 659 09027 530 02024 595 51321 900 10919 426 25117 169 00415 110 90213 264 87511 598 69210 116 0408 789 6717 614 6086 612 0355 704 7504 927 9784 241 5713 650 9793 141 9502 707 7042 337 7062 016 7481 742 7851 508 4261 315 1201 145 8601 008 267892 447789 410699 739632 885570 381522 083477 457441 573411 535384 018361 624340 277322 577304 850292 687278 080266 673256 293248 458238 223230 031221 505216 077207 239201 191195 040187 308182 633178 291172 226167 788162 066158 070154 633149 855145 825142 336138 894134 836131 392128 450125 474122 245119 101116 027115 002111 460108 668105 836103 406101 01498 29896 20093 75691 81189 41887 52885 38683 05581 86980 78979 48878 25375 41273 40073 35672 08970 31569 15267 08167 24966 02164 54263 66363 00461 35261 09360 17558 81457 62556 85155 86554 66653 88352 72952 53351 10850 54749 63249 10847 65346 75745 96445 75145 41743 94543 51043 04842 64942 68041 54440 67839 89939 34338 75638 57438 13737 76736 98936 12536 18135 90235 19834 93734 54434 31433 47132 90832 31732 49132 18031 74031 11830 85430 47130 02929 95229 31829 53528 95728 03328 16827 14827 72426 93326 77626 08725 97025 84425 84525 35025 03324 58424 59324 13524 19323 77123 18422 91722 74322 43222 10821 68721 81821 46621 45421 33720 89020 90321 27320 49820 06019 76219 16419 06218 91118 60718 40118 11517 54817 80216 97517 16016 96716 64616 02616 20816 02815 70315 63915 30415 24814 99114 94414 46214 25014 10013 85313 83713 58013 41613 34213 15712 92412 79012 93312 44612 65212 31512 14212 03412 05411 76211 86711 48511 50511 29911 40811 08311 05510 94610 80810 78710 80110 50410 41810 49610 05910 0009 8829 6919 8309 4119 3159 5829 2099 1049 1348 8968 8548 9639 0368 6848 4518 2728 3078 2378 1627 8328 1648 1847 8067 8397 5727 5847 6697 4967 4047 4387 2107 5027 2537 3097 4487 3166 9577 1447 1556 9727 0156 9347 1176 6436 5256 7476 6786 6006 6036 3996 5046 5016 4266 4216 3246 3446 3596 1486 2366 2266 1276 0506 0205 8765 8895 8185 7575 7765 5505 7355 5275 5825 6455 4725 3835 3195 3715 2615 1805 0495 1355 0915 0075 0865 0645 0705 0155 0374 8474 8304 8534 8634 8124 9274 9234 7574 7894 7754 7684 7224 5944 7534 5644 6404 5204 5504 5564 3794 5114 3844 3574 3684 5524 4874 4684 2844 5294 4614 3134 3664 1934 2794 3094 1764 2354 2233 9974 0704 0794 2573 9394 0834 1454 0244 0003 9384 0423 9853 9283 8463 7863 8733 6973 5453 6363 6673 5993 5953 5743 6093 5233 6783 6013 5023 6093 6723 6253 4643 6463 4803 5353 4463 4963 3893 3263 3253 2243 3613 4023 2743 2003 2183 3073 1283 1793 2783 2413 1593 2323 3033 2183 1763 0083 1503 1633 1423 2263 1503 0723 0373 0543 0242 9973 0693 0062 9952 9022 9412 9672 9773 0462 9142 9272 8932 8252 8532 8532 7192 7272 7472 8232 7822 7662 7182 6582 7612 6252 5832 7392 6182 6342 6592 7802 7142 5302 6432 5592 5472 5372 4492 5572 5322 4752 6012 4982 4182 4462 5532 3992 4132 5162 4682 3842 3272 3562 4212 3252 2912 3222 3282 2772 2482 2302 2082 2092 3632 2412 3452 4242 3272 2102 2682 3102 3542 3272 2472 2082 3532 1622 2282 1502 2422 0572 1232 2072 1832 2152 1602 1282 1072 0712 1682 1082 0202 1152 1512 0422 0792 1042 0522 0742 0182 0182 1142 0542 0341 9852 0902 1282 1242 0742 0662 0911 9701 9992 0612 0752 0922 1061 9022 0671 9691 9902 0332 0051 9251 9041 9131 8901 8651 9511 9051 9221 7681 8391 8231 8541 8721 7931 8371 7441 8091 9031 7481 6721 7021 7691 7751 8301 7371 7251 6951 7331 7761 7301 7021 7141 7021 7421 6121 6111 7041 6331 6611 6241 6041 7131 6291 6261 5791 6311 5471 5481 6401 5091 6111 5611 6621 5511 6221 6631 6761 5941 6191 5661 5391 5721 6731 4201 5841 5411 5791 5551 5061 4211 4781 4711 4281 4281 5691 4821 5121 5061 4411 4701 4181 4481 4221 4351 4561 4411 4191 4231 4211 3941 3691 4041 4371 3581 3691 2951 3781 4111 3891 4311 3731 4271 3771 3731 4231 4531 3441 3821 3221 3731 3301 3731 3831 4221 2581 3441 2141 3151 2481 2341 2591 2991 2831 2501 2631 3661 3211 2831 2541 2791 2401 2421 2611 2411 2621 2501 3001 2921 2991 2541 2701 2331 2761 2951 2561 1721 2361 2471 2861 2371 2421 2421 2361 3191 2141 2141 2161 2011 2021 1731 3161 2541 1381 1561 1471 2361 2261 1921 1211 1631 1891 1451 2361 1531 2291 1111 1321 1651 2211 1251 1671 0951 1091 1511 0471 1291 0581 0891 1201 1541 1441 1281 0911 1151 1551 1361 1061 1411 1121 1771 1321 0851 0741 0821 0451 0821 0341 0631 0881 1411 0721 0371 0561 1201 1111 0571 0601 0891 0751 0391 0391 1031 0731 0001 1071 1231 1111 0921 1141 0871 0331 1591 0181 0711 0811 0491 0841 1111 0021 0251 1321 0159739799561 0299811 0269451 0329541 0391 0219761 0331 0319911 0091 0449479739851 0079669561 0001 0481 0441 0009889879709701 0591 0581 0619421 0681 0441 1071 0251 0871 0659801 0251 0059379429319679869128769549258969009299739118709069139128449449198978608588768978268338808918698628748868528098768758078158198517837689078028318378338218208268508638458488478027608097978377727928448568378348548827768618608548368228587958968397947547707657768017737677767597567637701 053 751100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 528 95000000000004 938 960 12200000000000007 699 385 09500000000000121 618 156 06100000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G110G120G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %888 284 21599.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %887 560 70099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %723 5150.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %444 563 01450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.2 %864 013 44097.2 %2.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

51.2 %455 226 55651.2 %48.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

36 854 551896 961549 8242 279 079802 009812 833948 2641 405 195595 510936 522407 420388 097474 075620 291529 978678 202460 073574 787651 686934 071970 690974 2601 180 831928 5261 480 8442 529 621157 0834 360 806223 961217 830414 272465 467227 257551 444230 490233 670353 128506 014133 906758 07110 693 177494 301440 715806 894663 3371 272 4401 078 3941 746 0892 824 703295 792413 649352 909466 351202 505398 408404 062267 9221 137 564271 852602 660803 986 596051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.9%99.93%99.93%99.93%99.93%99.92%99.92%99.92%99.92%99.92%99.92%99.93%99.92%99.92%99.92%99.91%99.92%99.9%99.91%99.92%99.92%99.94%99.76%0.08%0.1%0.07%0.07%0.07%0.07%0.08%0.08%0.08%0.08%0.08%0.08%0.07%0.08%0.08%0.08%0.09%0.08%0.1%0.09%0.08%0.08%0.06%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped