European Genome-Phenome Archive

File Quality

File InformationEGAF00003616293

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

35 816 15070 781 089116 439 142168 886 817220 164 471262 364 573288 545 142295 073 718282 668 587255 352 218218 768 411178 479 332139 369 958104 618 46075 742 77453 168 51336 263 76924 164 83215 831 94710 245 0186 606 6554 294 6862 851 8781 950 7301 390 5451 040 390820 666673 405568 825492 295435 228388 335349 876317 072288 532266 204244 976227 693210 349195 791182 864170 909161 251149 633139 208131 165122 864116 317109 544103 62097 78691 70986 26282 82577 84173 09970 05666 76963 67059 62556 73753 41451 07649 86347 46146 38343 30142 40640 54039 68539 26238 07436 02034 64133 68432 71331 57331 44630 93830 38129 54728 47928 00626 96626 61325 64724 54524 21024 06623 14322 53622 45221 91021 15521 10920 83020 62819 84619 20519 07518 45818 32217 92817 46917 08117 08516 24616 24415 75615 25215 05514 36214 19914 36114 08514 11013 73313 53213 36412 97913 03312 75012 26212 03611 96011 73711 69311 36111 49910 92510 73210 54110 71810 66210 1999 8979 6569 8159 6379 3979 3739 0859 0688 9708 7698 8588 6868 4458 1238 1248 1898 1278 0508 0077 7837 7297 6647 5317 4777 3617 2587 0547 1537 1767 1607 2126 7776 7416 7356 6046 4776 3746 5056 2286 3016 3976 2876 3236 2486 1386 0236 1746 1875 9355 7925 6915 7505 7675 6175 5965 4775 2225 3415 2715 2085 1995 1405 0885 1115 1705 1575 1475 0414 8535 0994 9044 7874 8294 7054 7454 5294 4704 5274 4704 4024 2394 2674 3644 2294 4384 1524 3444 2814 3144 1714 1264 0924 1674 0153 9814 0903 9253 8483 7583 8073 7903 8483 7633 7823 7713 6363 6893 7603 5473 5693 5853 5903 5503 5153 4583 4363 3643 3103 3743 2653 2773 2473 0703 1263 1223 0263 0343 0563 1072 9973 0342 9533 1502 8072 9872 9722 8462 8572 8882 8012 8452 8512 7652 8132 8282 7922 7792 7142 7412 8022 7202 7252 7492 6052 5202 5472 5532 5792 5452 6032 5862 4692 5282 5422 5222 4582 4242 4192 3782 3202 3092 2622 2092 4032 3382 3352 2652 2262 2722 2562 2642 2942 2062 0962 1622 1162 0822 0992 1002 0432 0761 9842 0132 0942 0121 8741 9411 9591 9461 9691 8662 0302 0301 9861 8791 9031 9121 8591 7851 8701 7151 7931 8081 7691 7341 7041 6551 6031 7071 6621 6661 7151 6311 6641 6181 6851 7671 6881 6551 6421 7061 6861 6211 5621 6071 5551 6481 6181 5671 6181 5251 5071 5811 6001 5581 6401 5621 5411 4811 5051 5561 4851 4891 4921 4621 4711 4611 5141 4441 3941 4351 3901 4131 4321 4091 3711 4621 4521 3661 3731 3651 3721 4081 4061 3711 3311 3361 2771 3181 2191 2181 3001 2641 3141 3141 2841 2861 2221 3001 2231 2071 2191 1851 1461 1941 2071 1571 1901 1501 1501 1211 1001 1281 1051 1141 1091 1661 1671 1391 2021 0911 1861 1351 1551 2141 0701 1321 1151 0901 1141 1571 1351 1301 1021 0401 0691 0271 0791 0391 1121 0451 1211 0401 0201 0571 1059901 0561 0451 0631 0241 0811 0339769421 0039429871 0119439298959959129629199719329111 014925896875877907928883912865875865861866844865866891922885850927878876892884871866822897877906798785864845847784779814776795802789789787750789793758700758715772722777789768763713747753749733662682690707702680675686665670667650764670674698691649708681683692712764717667714701667626675734718698718710686716725695740753673683683673714672692666676660704660633634689656684654630652660688620644639613591610577645587589629576614605575596589574608580552580604585522595586587562597617680588605582524546554584530608508555553576569570612592565590638577591508524546515545526526539565575531528493514502593521500544539523558488503548555512471488479545529509494471490486472492492557497501477477464498496477481484469523541490502504511443507471522510514499558519543548493482480507461490489460454480467464467475482512482516531515480487455494486467483441411406455470488481511444440471480483488470454464434451424442449440457443476493417467470470442476461492455443445488474476425471470465450456464454439414456429425428405419421423428440417418416384421397446414407409402419415415369395407362391385410411368397334373359361349382370386384372385338381406357375387374395373394359380377329357374339355392364373364382335358350360371366298337351367390362367363370370341344367352392359350356402371331370362353343360335329338316325353367342358345379379344371332354368362354337310303320365361325342382327344343336335339346337343395330380352343357370366381348346348329349360325358361347362366350385402401376360332335372377376331333339346348348351367312 422100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 211 5740000000157 980 5850001 250 853 749000000000755 717 8770000923 672 96100001 847 758 04200003 651 684 38900018 240 054 42500510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.9 %175 797 76098.9 %1.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %175 621 57698.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %176 1840.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %88 844 15150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.6 %171 593 77496.6 %3.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

1.7 %2 939 7931.7 %98.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

10 713 059168 410103 144195 044144 388150 152169 048224 163111 466190 67887 80677 642100 274118 09966 923137 023103 244113 436141 054196 822189 716202 557247 195196 109309 224486 83240 211785 46755 32151 79989 931101 98049 332124 89850 51451 57373 76598 27734 758148 5161 900 499115 897117 018193 990167 812304 389261 709394 801593 643102 989112 30895 455118 92166 163110 906103 82572 258236 44780 674148 086158 548 244051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.91%99.9%99.9%99.9%99.9%99.89%99.9%99.93%99.88%0.1%0.11%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.09%0.1%0.1%0.1%0.1%0.11%0.1%0.07%0.12%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped