European Genome-Phenome Archive

File Quality

File InformationEGAF00000643308

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

362 420 19174 040 34817 136 4987 812 7275 115 3834 104 7513 540 8613 162 1252 885 7142 663 5152 489 6042 347 9742 215 5432 101 2052 011 2851 920 5551 839 0111 762 1971 693 5471 631 8661 565 3471 510 4171 455 7441 400 8101 348 9911 298 3321 257 2671 211 2391 168 2701 124 5031 083 9501 041 2591 002 156962 329928 291893 263860 833828 666795 251768 104740 359712 946689 303663 122636 378610 076588 029566 433543 353526 571504 599486 601468 650451 778434 602420 040403 141389 018372 523359 539346 758334 040321 268308 655299 413287 296278 124268 772259 840251 183244 157234 091225 566217 234209 887203 765196 889191 415184 953179 163172 950166 787160 903156 497151 073146 923143 383138 714133 762130 094125 993121 498118 086114 210112 375108 922104 834102 15099 72196 31193 34191 01588 02086 16884 13181 07578 82876 66574 75473 35571 00069 52268 18165 73164 10862 75460 79459 33757 93556 08454 40353 23752 17150 81249 25747 89946 79845 81044 63544 14043 03842 23241 23239 95239 05038 08637 06236 33235 55534 56833 76833 06232 68931 64830 84330 51329 99329 02128 80927 74427 40026 84326 23425 38224 91124 50424 08623 85523 14622 43421 94821 82821 11320 73520 22519 80619 51018 73118 12817 93117 54517 26816 81516 67816 45515 98515 46315 54015 41514 96514 61914 14414 02713 98713 78413 42713 23012 92912 74712 39112 38411 81911 56611 33811 00310 96410 75810 65510 51610 5499 9589 8669 5519 4569 2929 1269 0128 8328 6528 5958 3968 4288 1738 0747 9887 8577 7027 4367 4747 3507 2397 1936 9166 7016 5516 5506 3086 2436 2436 1555 8465 8155 6875 7625 6305 5335 4285 4645 2915 1684 9424 9114 8034 8004 6994 7654 6364 5874 6384 4834 5334 4494 4544 3504 1604 1103 9013 8863 9943 8843 8113 8103 7193 6763 6663 5063 4503 5223 3363 4263 1733 2423 3283 2113 0533 1353 1303 0313 0383 0242 9122 9302 9182 8192 8292 8262 7662 6232 7102 6412 5762 5372 5512 5022 4752 4042 4592 3912 3422 2562 2492 1242 1952 1752 1092 0762 1332 1222 1072 0112 0061 9981 9461 9511 9421 8961 8711 8301 8171 7981 7941 7761 7231 7781 7711 6971 7461 6651 6831 7231 6011 6001 6131 5761 4761 5661 4941 5311 5291 4631 4591 5191 5091 4731 5011 4411 3941 4201 3661 3301 3931 2831 3161 3171 3831 3131 3621 3671 2491 2211 3331 2231 2491 2351 1861 2211 1761 1961 1981 1421 1111 0531 1071 0991 1111 1051 0781 0711 0551 0569771 0481 0529841 01396299893093997694290892294293791792695197391988885590082985583681885679275684082178177065775875472572973572776168467573373173966569271868170469268868563464960764959859261460561861059662260059763860561656258859056254354953451652355651259654753554753256548750856855652443951350347545447245947042638441338337039642238140340738938939440037435835535437435033733734234135933833735234631033933333334234338333032930831030534932234332428029628131229029630027328227927525427725626027428027727525224625125926326526323228427527223923524623524523022622123521121822519321824121520522519720221420420222423220820320520619520322121319218316419116719618719818019019219117517914417615817818816818616716815917616516914817417017715516616116214716916520017217115816113815615416315916015715614914115714017314114715813812916114613612414013413813913914214012213512711612712414512612612613110312714313013511514212913113511512712713413011913013212712114011413413810915013411112711514411211411411511811311013410310812710994961191241121031181219193114939210711912110710311410111511211611110110299106808795106100857010087858675768784757569746866887062646074626467695586606867666486644860846467625359645973667254636357616366656570517356645464546355604451594648515539475158294844504544503641434549324538523735434835383344433435374445344650374137373346452640404039394243364540434140534141515537414844514746433445413237334632334347423944474851343448373854494060473950434753384148533548284836303630354740353733363325303732344440333938314626313535313922303634313022333023362720323924242729263633292826312425231729251721227 466100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0013 513 385003 645 4374 383 20318 395 16818 622 5618 645 06911 234 7824 732 9873 885 4677 583 2072 935 7099 303 8456 182 0479 688 56812 801 9357 634 6248 322 7355 782 21711 141 83115 006 73516 673 06819 443 07930 702 24028 786 94826 529 75028 237 49561 590 65795 914 83360 721 61997 261 140191 695 871211 255 891137 317 970348 258 095273 811 918407 900 838505 129 767856 409 30900510152025303540Phred quality score0M100M200M300M400M500M600M700M800M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.3 %46 937 88498.3 %1.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

97.7 %46 665 86097.7 %2.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.6 %272 0240.6 %99.4 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %23 873 88050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.6 %46 605 32497.6 %2.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.8 %1 316 2122.8 %97.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 360 0864 8704 1758 7764 63711 48311 03721 71217 88453 40743 57313 06490 66717 76316 161208 05835 597127 22568 81425 306135 6101 74285 185303 0891 40123 6822 1692 1232 2521 974 8495 1743 9265 0206 5906 7309 118189 104626 34516 9404 83830 33016 4342 74044 6882 9605 210122 7505 83010 7528 41022 0548 90829 98027 56242 50474 498175 48237 566 486051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.3%99.41%99.63%99.13%99.58%99.62%99.3%99.44%99.16%99.25%99.56%99.68%99.63%99.64%99.27%99.16%99.45%99.51%99.58%99.61%99.34%99.36%92.29%99.36%0.7%0.59%0.37%0.87%0.42%0.38%0.7%0.56%0.84%0.75%0.44%0.32%0.37%0.36%0.73%0.84%0.55%0.49%0.42%0.39%0.66%0.64%7.71%0.64%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped