European Genome-Phenome Archive

File Quality

File InformationEGAF00000643852

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

170 260 98634 389 94510 325 1556 469 5744 732 3653 975 8043 461 5863 109 2472 841 3272 618 5282 440 9292 286 1002 163 5552 060 0941 960 3851 871 7881 789 9181 716 7351 654 8561 587 6311 530 9831 475 3751 424 7001 374 4781 327 9761 281 5481 235 1151 188 5861 148 7661 112 3871 071 0661 034 647998 336965 472931 225903 683873 829845 421816 820787 982759 140734 669710 280685 880663 074639 336619 702600 672582 054561 658543 653525 827509 069491 529477 437462 076448 091432 793418 259404 238391 272377 153366 365354 907343 015333 412322 195312 283303 547293 173284 547275 563266 525259 403252 605245 016237 134230 080222 685216 437211 619204 727198 377193 575187 635183 163177 785172 681169 094163 198159 117154 706150 454147 136142 795139 437136 099132 073128 525125 524122 836119 606115 817113 189111 266108 577106 354104 001100 92498 83495 90393 97991 52388 63686 46184 92982 65380 48579 40977 08575 12173 25671 99370 56668 12767 27565 35864 66163 04161 89860 30858 99557 74156 79354 89054 43253 43751 78751 19950 55749 32547 93346 76846 75045 34944 62242 98042 55641 77840 98740 05239 15438 79538 00837 22236 47435 51634 66434 52534 14233 40332 84732 02831 47130 75930 25629 69629 37228 89928 34428 21927 34627 03326 45926 19425 32925 10724 77624 37723 42423 12923 21422 30022 09921 43921 17120 76920 07820 07519 63119 21019 01418 60618 28218 07417 88117 62417 30316 97216 89916 13216 41415 89015 63015 52315 20115 00714 62614 44614 25113 92913 90413 52013 27313 30213 15412 95712 38112 43012 05711 79711 76011 61711 44511 17311 07811 06510 88410 53310 73510 39710 32610 1819 9839 7139 4999 4469 3359 1209 0729 0329 0388 9328 7118 4948 5478 4498 3458 3398 0108 0757 8847 7967 7847 6817 5657 2877 3957 1657 1347 1176 8986 8216 7446 6346 6746 7266 6816 5056 5736 3846 2316 1656 1226 1346 0175 7835 7475 7685 5705 4865 6195 4155 4715 4555 3455 4525 3205 2245 2345 0625 0314 9964 8874 9344 6944 7504 7244 5644 4924 4284 4334 4264 3374 3544 3274 1684 2004 2044 1134 0614 0723 9773 9373 8013 6853 7443 7543 5703 6053 4253 5093 4403 5653 4503 3763 3943 3573 2653 3183 2723 2403 2293 2513 1973 1323 1363 0653 1013 0272 8633 0302 9272 8212 8942 8722 7682 6262 6522 6072 6532 6562 5072 5592 5752 5672 4832 4222 3872 3552 3292 3542 2672 3052 3442 2532 1692 1432 2532 1042 1882 0482 0932 1442 0092 0181 9951 9251 9342 0601 9071 9651 9241 8551 8901 8381 8141 8321 6941 8061 7021 6831 7261 6871 6211 6431 7221 6141 6271 7131 6311 5951 6011 5711 5371 5251 5121 4651 4451 5101 4071 4251 3791 3281 3641 3721 2971 3811 3711 3271 2821 3161 2921 2611 2641 2601 2661 2741 2371 2531 2151 2581 2551 1541 1441 1721 2121 1441 1291 1741 0871 1251 1231 1321 1221 0461 0511 0111 0521 0941 1151 0121 0641 0601 0721 024958936999989969905915954913884854866919854811812845836852865767825824802791737756753723774811735725750765734717717716653741670674700723692703715650725739639636645608650672654602603542610587582565558590569569518516569549530554525526507529547500515524508497559518463489543488494497519512480485475497451482448446425460402412458446404420447399437432395412369420389390446415440402426402374431380437389375353379360386374383349379396358326363353319345351361318363357349349396328374324359362355325367337329314337303312300339323315345348327334316315318311308310315309312296301324309293299312298306280286290308289309319285284303313258302274302304292334291268281281282236288276295274276243269246238294258256257266242246275289276245263283265260278274264256280263259275241259261261254268235240251237239235240218215218199232216229192227231223211202247214254243197233255248214225224213234248220215212217238213210209197221227207184208212217196181186169192173168186170181192184173169171189174171173184192175172187160163157153178169175179166172152163184156167138170160162165164171155152155139140147132128154131153126136139144158136131129128130139154130129134149124116120126135129132143114110105123107125141112107117135118132117117110124116121106111110118122112105105108114100104969799969699101108988483106948993998696112104109938279841038089897898898180918387761009087958085978491102868785929276698967838282838186737786100818189787588986576939173818310383776651847393736161737087656378907567797962655880716552667455648164746660766270616429 661100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

864 314172 272194 933209 725751 211261 2232 264 2383 935 4881 700 808759 50315 624 73710 140 5593 018 8222 072 5244 371 3161 780 3311 667 2984 004 7463 945 04620 668 14921 097 51415 325 49025 996 4618 221 93114 010 2033 256 44610 234 46724 708 90644 548 85635 650 71544 478 95648 911 91359 496 86665 674 714106 101 616149 636 321165 350 046224 448 291219 521 085244 768 361392 580 184438 103 770646 350 239491 328 632157 484 69769 679 05312 196 9898 972 41303 712 07103 392 601005101520253035404550Phred quality score0M50M100M150M200M250M300M350M400M450M500M550M600M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %50 708 05599.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %50 549 60898.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %158 4470.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %25 557 64750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %50 485 71698.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.4 %3 289 6996.4 %93.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 293 4933 7104 3558 0785 73312 51410 49927 56419 42654 75748 59513 924107 74520 96321 261236 50541 191100 16286 22932 276169 8061 441105 380377 8151 41919 3932 3532 2322 4071 239 1566 2625 3146 3748 4788 35013 068228 907665 63121 1225 60036 17819 7261 48653 1501 6122 900144 0243 5446 7846 91017 1049 63431 32029 52246 88485 040204 68840 375 300051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.71%99.69%99.81%99.4%99.75%99.82%99.66%99.74%99.5%99.61%99.79%99.84%99.78%99.8%99.6%99.61%99.69%99.72%99.79%99.77%99.62%99.64%99.09%99.58%0.29%0.31%0.19%0.6%0.25%0.18%0.34%0.26%0.5%0.39%0.21%0.16%0.22%0.2%0.4%0.39%0.31%0.28%0.21%0.23%0.38%0.36%0.91%0.42%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped