European Genome-Phenome Archive

File Quality

File InformationEGAF00004855695

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

26 904 46761 844 976115 107 630182 712 594254 090 846313 618 831347 165 565347 907 727318 381 787268 223 056209 710 836153 186 295105 319 78268 580 09542 635 11325 526 40614 845 6798 532 3734 938 5152 963 6471 911 3251 330 7031 013 351827 440702 330618 233550 178490 622446 266403 685370 241338 897311 254284 127263 884245 488228 096210 778199 901187 982175 503166 582158 215148 692140 014133 041124 146117 952111 066103 52597 88091 96086 02080 76076 92270 53366 61162 59158 96556 16552 36550 18849 34146 27644 96542 39940 51239 14136 94236 37034 54733 39832 11031 07530 03928 75128 74528 32327 51726 51425 80825 21824 33424 09823 60623 01022 06621 82121 12020 87620 20819 82819 66219 30718 72018 13018 08217 56317 48916 56217 11316 39115 79615 59115 25315 24714 79114 41814 04113 71713 40113 20213 19312 84912 68112 32412 03011 81912 05811 73811 46611 15711 03311 13010 80110 82310 65410 40610 38610 36510 21310 08910 0819 8639 2189 3729 4078 8148 8059 0208 9108 6308 5398 4598 4728 2418 2098 1747 9568 1357 8887 6387 2697 3477 1596 9866 7796 8056 8656 8966 6366 5656 5746 4306 2976 3316 4276 2776 1986 1165 9836 1746 0975 8245 6705 5455 5905 7475 5375 6325 4215 4485 5485 3415 3225 1735 2555 0184 9704 9425 0844 9564 8604 8644 7194 8984 8304 7774 7684 6874 7664 7104 4324 3934 4554 4424 4224 3984 2854 3274 2444 3384 2324 1964 1814 1154 0994 0463 9013 9743 9393 8823 8453 8823 7063 6793 6723 6573 5443 6343 6413 6573 5093 5523 5743 6393 4273 5013 4703 5013 3813 4003 5103 2783 2153 1323 2503 1833 1623 2113 1433 0663 0773 0232 9982 8863 0122 9672 8792 8652 8962 9272 8772 8372 8502 8162 9452 7142 7412 7522 7272 8692 7902 7812 6932 5752 5912 6692 6242 6202 6972 6912 6712 5942 5492 5152 4612 5252 5072 4712 6282 5522 4642 3902 3812 3852 4042 3292 3872 4552 3382 4042 4412 2352 2682 3722 3422 1732 2102 2452 1352 1962 1312 1582 1322 1102 0452 0942 0542 0122 0202 0212 0041 9562 0011 9702 0552 0371 9992 0392 0111 8821 8971 9351 8691 8421 8841 8451 7581 7411 8051 8191 7621 6501 7541 8051 7171 7571 7351 6931 6901 7021 7491 6781 7251 7471 7551 8141 6761 6041 8041 7061 7931 6921 7041 6921 6321 6311 5951 6171 5631 6341 6461 6181 6231 6121 5701 5841 5651 5171 4801 4411 4271 5271 6431 4771 4831 5801 4271 5261 4491 4431 4841 4181 4901 3661 3981 3891 3921 4691 4571 4791 4421 3901 3571 2721 3361 3321 3481 3871 3741 3141 3161 3071 2701 2431 2981 3211 2771 3691 3541 3121 2001 3061 3791 3141 2571 3181 2651 3271 2701 2361 3251 2851 2351 3101 2091 1731 1491 1711 2711 2031 1891 3301 2111 1861 0921 1391 1391 1211 1041 1441 0971 0931 0651 0531 1081 1161 0941 0931 1341 2111 1221 0631 1101 1461 0861 1781 1641 2041 1141 1161 0831 1201 0971 1481 1221 1531 1311 1261 0771 1141 0521 0371 1781 1231 1391 0361 1491 1011 1021 0391 0611 0641 0511 0981 0461 0891 0751 0351 0911 1531 1921 1421 1311 0551 1651 1251 1011 0581 1111 1441 1461 1211 0541 1061 1461 0931 0999691 0371 0201 0031 0581 0359689961 0001 0421 0651 0391 0249959431 0259971 0651 0729269809171 0209859659789569561 0109239559839239159749658989309309558679099229868739071 019932884942998978918907893880768866778823807813835840844766788792800767827793828834823766783778817813765827795763809795758774748736757765806760784718748757745763712724678731706681705687689659657611640670675655608644614645713663648632663644667678651676706698658664652627581577588580597625603655590605582562607575642561664617597614597584572605588574591618565586573597616607601685649598572545568552569591587571596571638574537604596608589572583623585576560556541550531526520570548485485528487471499480472506509494548488527524494548565567531463527514484488488483505503465520492526505487509484470457463470478485424466487455439423448439464434411450445455444458434483438441429420462446460452420434424414447408433418391422467401426438423397447423384383462425383446454383426438411441401420409384400427392411421421426375392351361384401383383388395400403406398397389421432411382335403364439392386385389374433375339401387367372371403358389390402376365356395370361342411323342356349355373350364369409372333397384350345324356383352352374328343348333352377341358353364341341372374392370367316313336353347301365356345365367323364327312370349310372327357349338351328299316300296291294306297295333337319287312314287317314300277301254304297301301270306300320284282269315270266260291266257238271264276251275244252259207240261287 406100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 760 357000000069 141 8560001 425 310 561000000000799 970 4890000852 274 76100001 755 044 01900003 541 178 12200016 634 805 26500510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %165 524 01099.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %165 009 71099.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %514 3000.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %83 047 96550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %162 556 55897.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.4 %7 336 2994.4 %95.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 467 654183 329115 956215 929164 197168 859196 849256 081111 437189 27690 28877 646113 547121 34866 166144 368104 472115 286153 068199 246220 038198 519230 082173 938277 882454 79833 480820 78950 80548 016101 38195 21540 870113 82148 42046 31281 764101 98027 107152 1592 366 720101 757101 733157 643138 473247 893216 759322 840493 05366 02183 28476 30994 08651 89485 15391 79169 281226 67474 941136 777147 840 579051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.67%99.67%99.67%99.68%99.67%99.67%99.68%99.68%99.67%99.67%99.67%99.67%99.67%99.68%99.67%99.7%99.69%99.67%99.72%99.67%99.68%99.67%99.79%99.59%0.33%0.33%0.33%0.32%0.33%0.33%0.32%0.32%0.33%0.33%0.33%0.33%0.33%0.32%0.33%0.3%0.31%0.33%0.28%0.33%0.32%0.33%0.21%0.41%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped