European Genome-Phenome Archive

File Quality

File InformationEGAF00000644570

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

214 909 89241 597 18411 525 5286 815 3184 915 7724 092 3603 557 0643 184 0672 885 7642 666 8622 472 8302 318 4262 184 5492 065 5241 971 6861 886 4551 802 6051 730 2511 665 7091 608 0321 549 9541 501 1421 451 9771 402 7451 358 9371 320 2321 287 1751 249 7191 211 6101 175 4281 144 7701 117 6801 086 6201 055 0941 026 283996 208971 033947 622918 987891 263863 893839 327814 754788 477765 506743 582721 906700 326680 467658 697640 467620 931601 651582 509563 792547 899528 514512 866498 215482 570466 517452 066437 770425 450411 352398 862387 454373 708363 839350 461340 208330 226320 689309 912302 672293 651283 926275 346266 945259 097251 261242 977237 878230 531225 091218 261212 157205 431199 467193 626188 233182 165177 239172 947168 655163 598159 962156 198151 959147 575142 879139 778135 518131 445128 718124 404121 630118 464115 406112 934110 163107 906103 999102 01099 53996 62794 50691 98789 92487 65785 98883 37482 13880 42578 33376 98174 50473 18771 36269 55768 12366 97465 46664 32662 57561 39059 82658 52356 54056 10655 02854 10653 13051 37250 28549 09948 38847 47146 05445 16744 08743 37642 61042 05140 75339 90438 84638 20837 34936 62235 62934 68934 03033 11332 52732 00531 65931 24729 98429 64129 43428 86328 22727 91427 50826 87625 98425 43625 05324 62724 35223 72223 31823 01422 75122 17321 85921 51221 18020 56820 29619 98619 59419 39018 95218 42118 18717 85717 64117 22616 99316 89816 20415 94515 83015 45015 23014 65914 38714 63013 88413 92013 65313 28313 17913 04712 60512 44812 33011 80011 76411 62811 37911 22111 16510 98910 69010 75610 36110 1269 9509 7599 4859 5609 5779 1688 8888 8558 7078 7008 2728 2788 1408 0467 6927 7537 6937 5477 5307 1797 1827 1736 9576 8456 8046 5776 7736 5016 5196 4126 2216 0665 8586 1235 8985 7795 5845 5495 6535 4115 4045 3745 3535 3055 3414 9504 9804 8914 7744 7834 6774 7014 6864 5284 5384 4194 4534 3404 0924 0984 1433 8933 9863 9493 8443 8133 9553 9753 7133 6803 5643 5063 3813 4143 4203 3183 3023 3943 1973 0583 0963 1223 0352 9033 0142 8562 9182 9642 7882 7872 7562 7402 7952 6732 6932 6602 5882 6072 5802 5912 5242 4642 3992 4452 3852 4182 3212 3272 4392 3312 2232 2632 2392 1672 1392 1012 1022 0922 0822 0871 9842 0121 9061 9651 9091 8171 8611 8541 7921 8181 7441 7551 7481 7451 8181 6841 7411 7021 6371 6371 6401 6741 6041 5411 5261 6141 5011 5881 5151 4891 4651 4811 4791 4121 4351 4531 3601 3401 3381 3341 3571 3491 2701 2851 3541 2851 3431 2631 2661 1901 2931 2921 2091 1631 1831 0841 1871 1421 1071 0731 0591 0821 0521 0519891 0099401 0131 02993393094092398896586393288385886091287889586986888784081281884583077078576580983074883277975580477975172476773773471869672168566069968971264365063963463461461461958258255056556256556854853456656251951659951452553956149652755248253848854848753349951849550949651952749545247648644248351544748347443848447242950945947742946843444449247347545348243344640943042942037243543241541442239837338939138038537536435435038438838035439439136235937035434636838537740635637735037338434232634335236836233232933033234334336234732529331528127729829130524633526628526027427125527125227130628526429226728625826330527825230025128727727826425928929030728726925927124824624123724524723923822424025026823321125123522523922724823623927425922122422021126823421124721023921321523124621822423220919721621125521019821424419922222622022422823018022522618720518320019720117719520420515916415615819518117416715615015816715813817813915117115317615716715217315016716314515914514815213414115714314816312312713514612515914515113812712813012212011112214111912213211915114913714112112716515614715111511912012914411813811412511411513112313112211911811811311711111313412811412013313213912215312399134132123117142116163117136113142141140131129127118139139145145144104132112125133111139124109128138971171261191171141021451201331371331211201251389910011310611510912211411211613013512112012812412510510811813410110210810411711611711295103105119891141321231411281371121001131181011078710695939010710610680119100909310592938496877895949471901031029482829710786971028097771029182979786808561897584728066737675808263608359726793706563627579747382585954556261484854494147612948424857564433473753323814 304100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

196 8524 16232 126728 628501 5921 981 3078 641 9701 471 203703 11828 287 29228 591 7789 284 5463 412 2421 317 3053 860 2813 014 12315 960 60242 844 28238 371 39529 530 40918 186 81017 855 68717 234 17310 883 06731 675 90241 097 46142 907 15436 325 03854 862 83149 958 786107 421 36888 283 679103 717 621193 137 658203 567 831307 031 226483 868 952830 052 685882 502 334265 641 030136 080 62617 943 8336 547 0436 015 8241 409 4680051015202530354045Phred quality score0M100M200M300M400M500M600M700M800M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.1 %55 118 78899.1 %0.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %54 890 75698.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %228 0320.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %27 819 62250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %54 759 45498.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6 %3 330 2996 %94 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 008 1485 2234 6488 1804 82212 48611 13823 69619 71058 26347 80814 038105 64018 43418 480240 55141 230138 63780 25330 703163 8641 667101 537360 5391 32130 9421 9361 7011 9972 133 2075 3744 1445 4926 9987 25610 640227 348739 12119 1225 12634 82419 1882 63251 4262 1365 198140 2404 01012 5926 73422 8328 27029 05628 59646 32284 596203 45444 215 718051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped