European Genome-Phenome Archive

File Quality

File InformationEGAF00001767437

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

413 622 002227 261 298151 315 837110 401 34484 995 88267 907 51955 708 36446 719 49239 863 54034 429 24230 087 98026 539 76323 630 16721 116 76919 029 69617 251 66515 713 71514 367 79613 189 15712 125 78011 205 79110 400 9409 652 3618 991 7498 402 9427 865 7167 360 7796 911 7006 503 0726 121 5695 775 4195 461 4545 168 3094 904 7584 654 0574 429 7304 219 0044 015 5023 831 5133 646 5453 485 2243 330 9433 177 7833 038 0852 913 0642 800 8522 685 1832 584 0262 480 8272 390 0722 299 5292 212 6152 126 0702 052 4871 977 0691 906 6261 833 5491 772 8671 712 5851 654 6551 596 9751 546 3701 493 2681 444 0951 402 9511 359 7821 322 2681 278 6071 239 0791 208 5281 171 8791 142 2711 110 8021 082 1721 055 8331 029 025999 601972 371949 173923 166900 106875 057851 078828 784808 768786 402766 283746 612727 296713 042697 924678 248661 940647 691633 424616 461602 447590 927574 408561 947549 743535 333524 242511 791500 892490 205480 558468 352458 427447 845439 045428 322420 914410 473403 534393 811384 627376 976371 253364 457359 042353 835346 711340 273334 098327 789321 485317 144310 555304 163297 154292 315286 952282 180276 974273 569269 412264 486260 072256 114250 651246 490243 698238 799234 863232 685227 370224 103219 855217 138213 085208 719205 870202 757200 076196 924194 220189 635186 900184 726181 648178 109174 860172 069169 308166 334163 375161 040157 776155 696152 718150 344147 654145 235144 020141 504139 796137 578134 924132 436130 044128 475125 532123 295121 817119 756118 280116 622114 990113 439112 538110 103109 303107 556105 168103 945102 343100 69999 20197 27296 63095 43094 16292 56292 21290 64489 05988 21587 37885 38684 21283 18882 90581 71680 27679 34378 28777 49575 69975 31874 44973 03671 66571 06169 66468 95767 73966 86465 97765 07764 20563 94562 82362 67761 73060 93959 85559 46658 79057 93557 62056 47855 96355 48854 08653 69852 97852 27751 19450 99150 41549 42449 10648 73148 23547 60346 80546 31945 57944 85944 28043 69642 81442 22541 86741 46841 03040 80440 38339 72239 61138 77038 28538 13737 83537 32136 54836 08935 44035 10434 81434 22533 89733 82233 74533 12732 50732 16831 97031 58131 12430 44230 47230 20829 91830 14229 62429 34128 79628 70828 14027 74327 33827 07926 84826 17626 36825 65825 43125 41425 15724 92524 51924 31723 92423 78423 44423 30323 09322 98922 64422 30722 09621 69621 58321 35721 17220 86920 59220 18920 00020 00519 71519 04318 98818 66218 77018 50018 28018 17317 89617 53917 51917 54217 21617 00116 79416 81916 56516 30116 17915 92415 90515 45115 39115 21515 01215 12814 84514 50814 35214 57714 30514 08313 88713 80313 72013 53013 18713 17713 29013 24712 93012 86212 75512 37312 45712 20912 25612 14011 91211 57911 74611 47911 56311 60511 40511 19911 32511 32311 34710 98710 94610 94010 48710 51310 48910 36910 20310 01110 0129 9909 8549 8259 8869 7069 5119 4509 5579 4339 2409 1018 8278 5978 5878 4078 6318 3458 4498 3088 2958 1928 2398 3808 2418 2138 0567 9727 9108 0337 9237 9717 6977 7207 6597 5627 4807 3917 5137 3547 1427 1377 0056 8576 8826 8576 8556 7696 9086 7086 7046 7636 7956 5896 6276 6066 5396 4346 4566 4526 3186 2456 3076 2806 2726 0206 0105 9965 9455 8105 8965 9355 7885 6145 7915 6685 5865 5575 7775 5775 7575 4595 5475 3955 4225 4675 2615 4225 2425 2705 2345 0375 2535 1465 1245 1155 0224 9844 8674 8624 7704 8564 7494 7264 6824 7484 7734 6154 6114 4864 5774 5854 3784 4754 4174 4034 3694 3924 5134 3804 3094 3084 1254 1434 0544 1694 2084 0744 0934 1843 9403 9583 8983 9244 0033 8853 8053 8913 8113 7433 9303 7573 8733 8363 8303 8813 6913 7883 7163 6403 7003 8173 8343 6883 7223 6063 7333 6703 6033 5673 5323 4463 6253 4033 5113 4713 4263 5263 3813 4263 3893 4493 4123 4703 4103 2753 2043 2713 1653 2043 2713 3013 1673 1433 1313 1503 1373 0372 9403 0552 8842 8452 8922 9032 9122 9122 8892 9812 9332 9512 8172 8362 8472 8342 8352 7712 7552 6902 6522 6992 6942 6562 6882 5962 6872 5712 4952 5202 4872 5102 4442 4372 4202 4802 3962 4462 5032 5032 4722 4382 2772 3962 4792 2862 2452 2122 2122 1142 2122 1012 1072 0722 0171 9751 9851 8372 0191 8711 8801 8701 8461 7581 8191 8301 8251 7981 7871 7521 8021 7071 7661 7871 7571 6421 6181 6121 6281 5351 5691 5121 5661 6051 6071 5541 5091 4951 4761 4741 5071 5141 4241 3841 3531 4061 2521 3031 2651 3091 2711 2471 2451 2741 2711 2691 2371 2621 2061 2241 2391 1961 1801 2001 1361 1641 1561 1801 1681 1821 1411 1161 0601 1311 0791 0299939659671 0251 0231 0071 0309991 0681 0381 0201 0051 0581 1069871 0061 0001 03397696995599898791395598995898094389295095389291290683384185193487777579377881778379371775569177975276274877374473875575879773675972271870470067162867868967967867769067866063465066468066568869566164369766168975169065170367968666972968771068867966869566361565965461961760260859457459961664162260565760662162363763663869159161057159058259559257958854359754651153958550955352651147549556654454054554454649150848546448845946250652050350445648151646447643746444650049845051145349746641644244350148348646650341748849441146746947245146546646148643048544243542444744143445645444644045039144539339340544341241738939043442637540943941741437841241341440236541342441242040842641038843547343544839042739142040239039139240435938734938037137638539437839236337936136732834433533431432934629529029926927128727329326225629955 076100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 720 69500000039 512 306000667 840 8470000000000379 701 4540000395 101 51900001 239 795 47000002 584 103 824000016 631 869 89900510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %145 232 96499.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %145 165 67299.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %67 2920 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %72 654 45750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.3 %139 988 01296.3 %3.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

14 %20 323 81614 %86 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 137 663151 55155 538230 28862 57967 101353 67690 47869 463114 61443 45136 703141 65945 99633 16966 93539 11141 85080 03849 25344 58266 20981 57150 69796 811118 88527 572359 93228 28324 35071 27438 09748 03947 50831 00328 38548 73740 85924 22968 3881 109 55673 84981 125115 35588 187142 449146 909158 556405 95241 07966 80943 66675 88735 49885 42646 49442 498139 21441 18490 230139 369 334051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.96%99.96%99.96%99.96%99.96%99.96%99.96%99.96%99.95%99.95%99.96%99.96%99.96%99.95%99.95%99.94%99.94%99.95%99.9%99.95%99.96%99.95%99.81%100%0.04%0.04%0.04%0.04%0.04%0.04%0.04%0.04%0.05%0.05%0.04%0.04%0.04%0.05%0.05%0.06%0.06%0.05%0.1%0.05%0.04%0.05%0.19%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped