European Genome-Phenome Archive

File Quality

File InformationEGAF00005283036

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

490 871 398297 198 046132 938 43182 668 43244 975 21428 529 05517 833 39312 143 7608 539 8686 393 8154 999 0624 091 5883 456 4973 005 3842 661 0242 400 1212 194 3832 035 0591 896 1361 776 5561 677 2581 583 4861 511 4261 441 8031 382 5821 325 3231 278 2201 233 8031 185 0711 148 6611 117 8201 083 3371 052 9501 024 183995 613974 617951 433925 336908 157888 064870 981851 443837 814817 559806 768790 732778 923766 231748 821739 542725 596716 137705 303697 718684 227674 536665 477654 728644 868638 563627 489620 790611 886604 762598 345591 356581 094572 244564 961558 896553 849546 359540 251534 849528 026520 825516 764510 803505 801498 135495 244489 089483 184479 180474 205468 372464 661458 664454 859450 326445 766441 002438 310433 431428 652424 300419 620415 988410 657406 175402 431400 314395 673390 314388 375385 042380 290376 155372 746369 270364 850361 376356 704353 748351 750347 474344 596340 942336 840333 769331 394329 226326 875323 123318 989316 425311 793311 317307 944303 843300 095298 518294 845292 842290 101287 711285 266281 981279 418278 042274 837272 534270 947267 157264 894261 975260 419256 265254 741252 086251 038247 443245 621243 635241 430238 968237 287235 255232 527229 810228 489225 166223 000220 782218 219216 778214 719212 884210 415209 070207 569205 231203 183201 755199 625196 897196 205193 940191 866190 264187 807186 476184 072182 802181 023179 303177 077176 165173 925173 470171 586168 904168 137166 243165 831164 165162 464160 491158 666157 969155 918154 526152 959151 630150 339148 600147 440146 353144 959143 074141 580140 762138 912137 589135 629134 608133 479131 869131 084129 830128 504126 634125 736124 764123 651122 042120 977119 216118 570117 052116 349115 449114 373113 204111 863110 799108 746108 160107 588106 462105 268103 574103 394102 790101 719101 34599 43098 99698 04096 95195 39294 97693 74091 99691 13290 72289 20988 50087 64086 90385 76785 07185 04483 73683 21182 85381 21680 02080 16379 37678 62677 37776 96876 30775 77274 71474 47772 28772 13771 83970 99570 42970 02369 31368 52267 58367 13466 03565 59465 21564 05663 08962 65661 87960 85460 20559 28858 71558 41358 17058 09657 12556 61555 63454 96554 50753 90153 42452 78852 21851 54851 52450 59350 51449 60449 16048 80647 93548 07146 95646 25846 04245 37445 17244 55344 41243 61443 36742 90942 53941 85941 72341 00440 93640 74640 07539 99539 45839 17238 64538 18837 48337 15536 60336 16435 83935 72635 05935 08134 75234 22833 65333 62233 19032 98032 41832 35831 96931 53231 31331 04730 74030 50930 04929 87729 57629 25228 84928 59328 06527 47327 13526 80226 69326 20826 27125 82125 47325 47525 14024 71824 03724 18723 77623 50222 98722 88622 92122 51522 33622 09521 94721 68421 40521 11021 08821 11120 65820 54619 88820 10119 80519 85719 36519 12919 26818 84818 67118 53517 95018 12518 18017 70417 54917 35417 33117 07216 82916 64116 48316 02715 90815 68115 59215 32315 17315 12914 83314 80814 33814 32814 01013 74413 77113 51013 61313 23613 05912 81012 82512 45312 56412 47412 14412 22012 04811 88411 62511 70411 45011 44311 38111 21110 90110 64310 60210 67210 35910 31310 14410 0639 8609 6049 8219 7309 4089 3179 4049 2908 9758 9988 9708 8348 8878 6428 4888 4818 2658 2418 2668 1408 0227 8377 8307 7337 7437 4067 4907 4467 1367 2337 0657 0697 0987 0747 0026 9026 9166 6936 5546 7446 7026 3436 2496 2576 2166 1176 0245 9225 9555 8265 7505 8845 6795 6575 6495 4345 5395 5235 4335 3595 2665 2365 1985 1135 0925 0224 9524 7724 8724 7084 7934 5444 5874 5084 5274 4254 3884 4494 2344 3784 3884 1304 2134 0594 0824 0933 8753 9933 7453 8183 6923 8443 8023 6153 8513 5933 7483 6153 5823 4663 3833 4493 3863 4283 3343 1963 3573 2973 2913 1853 1103 0793 2043 1413 0532 8822 9622 7772 9202 7862 7552 8162 8002 7872 7702 6372 6462 6902 6442 6732 5952 5492 5122 4672 4722 4112 3872 4782 4412 3932 3182 4532 3982 3002 3482 3612 2752 2372 2902 2012 1652 2002 1932 1282 0582 0321 9981 9772 0072 0631 9221 9382 0161 9971 8781 8451 8991 8171 8111 7861 7331 8371 7511 7851 7951 6731 6371 6561 6321 6781 4971 5701 5871 5121 5181 5251 5441 4741 4571 4391 4491 3841 3871 3561 3601 3321 3111 2491 3091 3151 2501 3051 2731 2351 3111 2261 2201 2481 1881 1701 1371 1681 1551 1381 1191 0961 1261 1001 1101 0941 0601 0599891 1021 0181 0121 0099461 0059111 01092593091386085979191690290287584581387786079785389886882778679587181078385278981282680075073378172275773776073371273273273270174073071674274073069867867967671966163172766464967666065466762566768065162765965264266768264163860260256958857957455458260757156156355358853754159456056153854455154848150853050148751953845647651548450251546249749647144545444942243143142939942442743942138840437439236837338437236336637637837036533336637436035730933934731133333135133633434634332931031832930829729729230330030734430331030133131934329327628328128329826527829927029228829224026126829528128525924926126227028427227429428327426925726227427729722425524622824323624621922621121521825219621119622222422820722321123620424822019818720619521421022120120823822622623424920220523025721523621721123320421021718821420620321618321419119417921219420621818720721520018118218717016018918417817818616918717619618019717516716916815818015617113114014516098 203100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00144 851 63300000000686 144722 29253 605 82772 097 516245 472 703146 007 008010 550 272379 2281 113 77421 579 7953 412 02845 140 51224 846 50287 413 38940 313 90573 783 57291 752 391180 050 19914 256 020218 032 60168 722 535455 876 764689 488 932335 884 849572 289 195762 911 73310 230 815 4050000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %115 806 318100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

100 %115 802 670100 %0 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %3 6480 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %57 904 98750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.1 %114 719 30499.1 %0.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.9 %9 127 5067.9 %92.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 379 84563 78427 54088 68929 24226 08794 08537 53224 65772 55316 66014 89742 50722 3158 59037 91411 32011 30825 37120 13214 13839 26526 67229 53843 276100 2316 138523 1757 9787 43214 38713 3866 38136 6417 55510 99112 67516 9384 86047 399442 17914 88126 75118 98236 93642 53963 08171 088155 203527 26417 07746 31811 67650 07516 30416 79510 44091 76323 25052 548110 721 518051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%99.99%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0.01%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped