European Genome-Phenome Archive

File Quality

File InformationEGAF00003610270

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

257 622 058226 703 923143 200 380117 432 49573 127 54463 907 28842 522 08437 507 38426 584 11322 963 45017 110 29614 613 34911 324 0939 594 8547 744 0566 587 3115 488 4374 730 0874 069 7873 580 8603 161 6192 846 6362 570 9172 358 6222 181 3712 033 4861 908 2171 797 4471 712 2171 635 2921 570 4181 503 9941 458 7141 404 7651 370 5871 329 2331 286 6331 254 0661 229 4281 198 6771 164 9851 145 7991 126 6761 093 5541 075 5821 058 1851 033 1211 016 212999 736980 156966 041944 928933 927918 243902 822886 152872 691857 920847 073833 028820 704808 299793 589783 318771 806757 803746 591738 130725 141717 129707 122696 905686 488678 491668 840659 462647 691643 691633 594624 250618 106608 147602 581595 212586 809578 561570 570565 384560 222552 243546 103541 139534 720528 755521 778519 164509 573506 197498 969492 364489 843481 621475 935472 050467 115461 338457 059452 365446 630440 634437 338432 861429 609424 783418 812414 861409 514405 199400 754396 750391 886388 184385 173380 540374 715371 437367 238364 336360 684357 271353 143348 882345 305343 386338 517334 483331 286328 625325 070320 780318 090315 939311 843308 877305 277302 064299 234296 092291 790289 382286 237282 262280 163277 552273 542268 677266 691264 117260 518258 910254 442253 483249 093248 126245 541241 730238 970235 604233 524230 218227 607225 314223 018221 358218 908215 534212 900210 198207 828204 948203 234201 606197 068196 403193 251191 915188 978187 178185 485182 292180 738178 890176 209174 698172 177170 419168 328166 814164 375161 436160 828157 953155 487153 188151 090149 523148 611147 608145 258144 174142 507141 227138 329137 815135 241133 204132 879130 835128 720126 774124 376123 463122 738120 048118 988117 614116 557114 210112 739111 702109 804107 336107 809106 833104 567103 779101 408100 65999 72197 34496 94695 69993 81392 73791 69589 98288 54587 35986 16484 76983 52482 83581 25179 82678 87177 01975 89375 45474 24273 75971 98271 57070 80868 87868 29767 40466 06364 81164 18263 35063 10361 99461 25760 22959 46858 56458 10957 38256 35055 89554 34953 49252 94951 96251 16450 53249 16048 67147 80247 01446 60446 29245 14844 40143 37542 52241 98341 69241 01340 09239 54339 21938 70938 09937 89837 05436 10935 26534 85134 13033 76133 45932 98632 66232 26631 77530 64230 58629 92029 76329 17228 38228 12127 52126 83226 37926 37625 91925 22024 89624 48624 17223 75123 52623 24822 95722 44421 97121 89821 49320 86820 46820 14319 84019 74519 05818 85318 63218 29817 80117 56117 38617 25717 03916 60216 70615 96015 85515 54615 38415 11514 79514 64514 24213 94613 94813 70613 30013 44413 19812 88512 58912 45012 26712 20211 73211 81511 42911 18110 65510 98810 66110 56010 34710 03310 10610 0709 9859 6729 6729 2659 1888 9258 7928 6058 4688 2658 1918 0367 7797 6847 5297 4607 3067 2457 0226 8556 8436 8426 6496 4546 3146 1946 0556 0725 9265 7175 5785 5685 5585 4515 4515 3165 2035 1945 0284 8544 8904 8154 6864 6564 6244 4944 4124 1894 1234 2204 0674 1693 8693 9433 6113 6853 7443 6483 5343 4203 6333 3553 2823 0733 0203 0512 9662 8682 7122 8232 8762 7902 7152 6772 6672 5252 6442 4002 3972 4482 3332 2832 2322 2292 1882 1262 1502 0382 0072 0541 9621 9951 8911 8161 9151 8541 7831 8101 7721 7421 7401 6391 6521 6671 5661 5331 5431 5141 4881 4251 4601 4901 4551 4251 4171 3551 3481 3941 3901 3011 3661 2631 2591 2761 2441 1731 0691 1531 1761 1651 1901 1431 0551 0511 0329838999901 00398088588295395288986882786283486178282179378775379973578385674572568370468567764664965268563766164660463363164560464360960057652458152953760252854452752850145244748344045746745242546942342040742342440141537444141641556138240038937037939837134338839436437934335736436535240836036734935732631331636035035433932336235633133235527632231230332028928131631627726125024026025027619824722024124623524726623924322423023024127622426327422724628425824123423324623121124019921822122620619220819522720321822518321621823421820318120921021323324023319123822423721421523323624423826422826122223820225222924424422421522820321822222418123329520719020121318719723120119220518918820816516419618416518717517116016917218316915216816916415115016130315816917617818318917715317818318517817515918516316517417416817216617118615617416515317616318317830018517917917715617818015218316418016317815418317515915315916216716917416917314815515815715421815621815214813714013212911212713714611713711314913914615515113113815614815014514013914113715112913415014314113914314213215911913215314011912413511912699121115115112114949513112714212512698961109610912113110195124110117103861001079292911129790961019710996114101971229910310011696102951098786819588107989511394849896839110981871018192728881917478857999549472777664536176676776688362728075827053697668848677747980806673851008227 655100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

398 5680000000000457 588 4400000000000000726 783 5540000000000014 777 389 30600000510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %105 606 91799.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %105 534 47099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %72 4470.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %52 854 83450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %104 599 47498.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

48.1 %50 877 57848.1 %51.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 098 02721 64012 01937 10114 71716 49621 34532 80022 12835 89511 16310 31714 80416 3228 72129 76810 33312 36916 12220 01118 30535 10730 58825 47946 141100 9595 093446 4546 3106 15117 04114 1427 98729 8106 6836 71311 91116 3484 42442 745551 46926 35518 41342 23226 31170 61474 182135 416394 55320 80534 84826 65743 13913 90024 94724 43818 929119 36318 85946 865101 037 407051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.94%99.93%99.94%99.94%99.94%99.94%99.93%99.93%99.93%99.93%99.93%99.94%99.94%99.94%99.93%99.92%99.93%99.94%99.92%99.92%99.93%99.93%99.95%99.83%0.06%0.07%0.06%0.06%0.06%0.06%0.07%0.07%0.07%0.07%0.07%0.06%0.06%0.06%0.07%0.08%0.07%0.06%0.08%0.08%0.07%0.07%0.05%0.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped