European Genome-Phenome Archive

File Quality

File InformationEGAF00001558989

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

1 764 9871 086 673898 738901 8511 025 1151 329 8581 833 9652 564 3933 486 6834 521 5465 637 5366 706 3317 694 2348 633 0449 636 11410 878 69212 602 78115 004 15118 219 46422 296 64527 186 84732 729 70938 661 95744 630 20150 284 16455 252 76559 369 98662 450 79964 447 84165 460 32465 616 65965 154 14864 380 86363 480 72462 657 57862 047 42661 751 32561 701 71961 869 97062 203 70162 527 07362 709 89062 709 58162 382 35361 793 54560 854 22759 581 63758 030 65456 203 83254 231 51952 110 75449 925 69347 700 42045 449 18143 208 90240 985 29238 790 34036 648 04834 584 46632 569 67730 622 64928 715 57326 867 10425 109 67423 411 33421 812 73120 287 57018 873 64217 535 04816 289 54315 132 11214 062 25713 093 67012 196 05111 361 80510 599 3049 898 8969 269 1478 676 1128 135 3157 639 3697 174 6546 765 1746 381 8966 028 5445 701 7605 412 0495 145 1454 897 7714 676 7574 462 5914 273 0794 089 7253 923 1993 770 2703 625 4113 494 8093 366 3763 250 2423 137 0873 023 4352 922 1482 822 0072 728 8932 633 6882 542 4202 454 5502 366 3002 283 1102 198 2312 116 2982 037 7891 960 9341 889 2041 820 7101 750 6711 684 4111 624 3111 564 8581 511 9831 457 3861 405 2601 356 6481 312 9591 273 8541 232 5571 194 3041 157 5391 124 7131 092 5541 065 2251 038 8331 012 182990 037967 145948 343926 853908 960891 077873 807857 661839 601827 712812 862799 991786 457774 192762 728749 132740 205726 939714 311704 462692 555682 834670 189659 587645 305632 513618 772607 711594 615582 258570 313558 771546 851533 936523 307509 958496 399480 342469 619455 830441 468428 541414 910402 882391 265378 601367 829354 708342 172329 553319 956306 990295 354281 879271 104261 758250 947239 452231 743222 325212 471204 028196 429189 151181 601175 119168 956161 712156 227149 761143 637139 802133 992129 708125 016120 966116 595113 913109 653106 566103 903101 37897 87496 16492 81091 44789 53988 21286 06784 31682 93482 02681 38880 10878 78677 22977 15076 43276 19275 13174 50773 79273 45973 78173 11072 77572 14172 07371 44171 48771 21970 69970 33470 25470 16770 41870 18470 13469 66769 03968 77268 86968 98368 45369 16767 62968 16167 59067 25166 40165 99764 74663 81363 32562 42062 40161 22560 76659 56959 09658 70358 27057 16556 41455 87755 22254 80054 30253 86152 89352 70651 87551 22850 44050 06649 21648 93048 69147 86947 22746 94945 98245 20144 90743 75343 44842 64642 11342 10740 97340 80339 57639 13338 41837 89937 44736 99736 11235 43834 51333 99033 12732 85432 11131 48030 97330 10229 64228 91228 15927 60126 93526 61725 17624 61923 45923 11022 33921 79621 23220 56919 81718 89418 38717 83217 45216 87216 37515 85815 37814 73814 36713 81513 53413 10112 62712 18311 84511 51610 96210 63710 0729 8299 7579 3789 1969 0318 4418 1818 0937 9557 6757 3747 1976 7916 8106 4766 4596 2566 0766 0286 0115 8255 7555 4555 4195 0835 0754 8394 8234 7474 7534 6684 7044 6854 5314 6294 5074 3524 3284 3074 2294 2924 2704 2024 2414 0964 2914 1154 1344 0943 9994 0534 1073 9323 8883 9793 7743 8593 8973 9703 7453 6173 7723 7383 7553 6773 6683 5663 6803 5893 5973 6023 5393 3633 6263 5513 4223 5333 6123 4723 4853 4003 3883 3113 4143 3903 2363 2983 3233 2073 3093 2963 2273 1433 2963 1723 1743 2513 1783 0633 0183 1322 9542 9922 9902 9452 9612 9892 9203 0143 0043 0123 0332 9292 9272 9462 8882 9062 9552 9482 9362 8302 7622 8442 7782 8382 8672 7402 8152 8022 8522 6652 7832 7192 7562 7222 7952 6962 7532 7202 8382 7042 8332 7632 7812 6732 6942 6632 5692 6092 5972 5792 6792 4992 4942 5102 5252 3662 3532 3102 3572 3062 4422 3652 3992 3942 3982 3902 2852 3352 3552 3442 2672 2672 2972 2192 2222 2062 1472 2252 2462 1872 1552 2592 2172 1272 0742 1422 1722 1952 1072 1042 1172 1062 0432 1041 9422 0462 0361 9591 9452 0462 0441 9892 1662 0051 9451 9652 0061 8461 9011 9611 8791 8471 8171 8671 8681 8211 8991 8571 8331 8521 8141 8371 7821 9101 7881 7881 7571 8061 8531 7411 7511 7711 7221 7821 7531 7681 8251 7691 7291 6651 6861 6721 7201 6621 7161 7691 7221 7021 7521 6611 6581 6571 5331 5811 6331 6621 6271 5971 5971 4871 6281 5741 5671 5321 5411 6161 5131 5761 5271 5901 5601 5941 4871 4571 3731 3951 5301 5311 4281 5001 5001 4391 4291 4551 5051 4371 5011 4641 3961 4151 3821 4611 5151 4281 4071 3631 3751 4111 4681 4051 4001 4251 3911 3551 4281 3971 3331 3201 3071 3311 3441 3951 3351 3771 3661 2861 2681 2511 3861 3711 3041 2971 3311 3231 3051 2761 3571 3571 3331 3261 3581 3331 3541 4131 2621 3491 3121 2851 2101 3321 3101 2711 2901 2901 2151 2381 2991 2581 3241 2971 2381 2741 2561 1901 2761 1931 1751 1771 1761 1551 2021 1371 1511 2201 1581 2251 1451 1461 1681 1561 0991 1091 0051 0141 0841 0509921 0581 0601 0111 1401 0901 0951 0241 0711 0451 0271 0361 0941 0591 0261 0651 0161 0601 0451 0631 0371 0601 0311 0011 0191 0699721 0191 0129499601 0229221 0339789859159329109829849699341 0161 0219531 0031 0209671 0149338779481 0019539079819831 032900881879886904860813829829876830826809838772797843806786770717774800813810816863805845830763894825813817806767686755690753743726806734757786726722740731673711679724699668772723696718732771760683591676716648699765739693739673724711711767731691679711735684608634709656609655669692629679692681691672688659650645596648638686678618672648650624638676636583611638651647663670665655649642658667696638646617642622648621643630647592600576646586583630593586639620615597568630591598587588645635657616613616597601623604591580593587564613576604559604611642661655616638602576612602582595587732600602599560561622572563592540561609537538540569695 422100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0029 871 9030000695 816 5970009 014 298 36700000000005 120 834 08400005 548 817 615000012 949 284 117000021 746 556 103000088 693 070 89000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %948 373 26399.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %945 346 74299.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %3 026 5210.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %476 154 13850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.2 %925 601 88097.2 %2.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.3 %59 769 0736.3 %93.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

38 231 6041 524 8951 199 7071 674 4041 021 595842 7381 115 3751 020 719463 4861 296 621798 278805 7451 058 5661 005 325516 1321 173 192688 796654 908918 058946 736634 2171 094 659955 086901 1331 429 5402 050 653219 1853 733 488292 576279 431561 451549 020168 032760 442257 543262 208507 956606 106132 538886 92114 823 855822 8424 153 3441 143 383800 913653 005286 438297 508729 485959 8484 694 6561 651 8571 187 8341 133 9241 065 5663 342 1111 373 7081 269 7891 098 050988 772842 385 3773 9483 1553 2703 3823 5203 2913 3493 2333 201976 9900510152025303540455055606570Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.69%99.67%99.68%99.69%99.69%99.69%99.7%99.69%99.69%99.69%99.67%99.69%99.68%99.67%99.68%99.69%99.68%99.69%99.69%99.67%99.68%99.69%99.4%99.61%0.31%0.33%0.32%0.31%0.31%0.31%0.3%0.31%0.31%0.31%0.33%0.31%0.32%0.33%0.32%0.31%0.32%0.31%0.31%0.33%0.32%0.31%0.6%0.39%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped