European Genome-Phenome Archive

File Quality

File InformationEGAF00002308469

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

89 543 940158 048 345227 144 749282 593 242314 517 881320 382 942303 148 660269 348 935226 843 422182 248 044140 446 231104 466 39175 220 66452 717 47636 014 75724 149 47115 929 13610 404 9786 767 4714 400 5052 902 4941 957 0741 366 429989 208751 106591 514482 599405 777349 940307 428274 560247 867224 043202 923185 390169 315156 679144 892133 605124 791117 186108 238100 73295 98688 62083 28478 01972 02767 92364 91959 96756 35354 55351 40248 81145 62343 98842 39139 74837 75536 22034 78633 96332 04331 08429 97528 72528 23426 47125 89125 24823 47423 48823 23322 22921 87321 40220 83620 03719 13118 65318 17017 30517 02016 62715 92315 30415 18315 18214 64814 26113 89613 22013 35712 91112 90812 62211 98211 71811 87311 13010 99110 57510 63510 28210 13110 0869 8759 9659 9939 8029 6509 1029 0048 9368 7498 8698 6338 5598 1408 0398 0607 7147 6257 5717 2727 3807 4437 2807 3277 2326 9796 8496 8436 6286 3836 1806 3806 2115 9825 9235 7825 5635 6525 4255 5895 2665 3085 2755 2155 1835 1294 9555 0915 3084 8424 9314 9114 7694 6484 5684 6124 4954 4734 6064 4544 4654 4244 4314 2614 4964 4624 0514 0894 2044 1764 0064 1153 9413 6653 8103 7963 8223 7383 7403 7564 0453 8343 6293 6153 6953 5063 5103 3913 3053 1853 3903 2343 2883 1833 1023 0573 1163 0082 9913 0242 9692 9922 8972 9572 7592 7952 8712 8512 8082 7122 6132 6762 8192 6452 6302 6382 6072 5772 5272 4582 4452 4212 4862 5272 3792 3352 3842 3192 3782 3692 3912 1952 3652 2492 2602 2372 1712 0742 1332 1292 1542 0622 1532 0962 1562 1392 1052 1192 0922 0871 9532 0711 9762 0231 9241 8631 9231 8731 9161 9011 8431 8301 6551 7571 7801 6701 6771 7081 6681 7851 7261 6431 7381 6711 6891 6641 6141 6221 6071 5701 7611 5071 5341 6331 5541 6861 5301 5461 5001 4471 4281 3961 3661 4041 3981 3961 3631 4171 3321 4371 4551 4111 4701 4201 3541 3831 3811 3371 2991 3231 3041 3301 3181 2841 3841 2801 2831 2461 2811 2591 2691 2781 1761 2081 2501 1331 1881 1471 1511 1611 1661 1851 1921 1861 1111 2201 0431 1091 1451 1021 1671 1761 1391 2001 1811 1781 1551 1641 1671 1611 2081 1861 1001 1021 0691 0731 0029971 0099789889661 0601 0269879821 0019891 0219411 0059721 0301 0031 0109651 006958973959995989935976886953858894868869882964931938884876943851881907854965917884863796898843806867872817868826826856836819853893813806818770753803823760825777804827777789768792762792766799908782802774810761775769788833809838788806752752708749729718734730748730797750736740751722673760720760736757710676762712722730732697739721656737769666688652685666725650697639702722701712631680600694658643661646626597562610626589600572631634580633568519618569553520549574564569531515531542490537566547480540545556570509521549544569554509490535536518547576511537547514493502564524520555550527556543571616669583519539587579533507490501525505484496477541490510507495511500506527512512520553483493524491489483464494489517509508508499475464468470505448451465452460421423440454473467448474475511468437419468490447440446456431494433466402433384420423406457467399421408395405408453449468408399462421437386434407425418428402416424396418440388394435390426406430405376410437467420391380398418349391389366369354351352382377378364369410398396387393443361366371340378359347346361356364344350353360374312360372372342332318359329358337322335351330302300334316307325272306327286318312353289342309301291315302301286281285303299326268299262311322289301267276286329292279264309285289296321306273273298328265296280278303284283269275284269276287288267260271279271287290289286284352330327258339312266267287303259298292313257251262266270253306294294260286277252277267265252251285254281254267278263291276275256272274281270264265258246272251262269241267280258303235231267280240282240249258259241249244249245234228239225236270224220246242217258225229235256221237222212226250232266223234214234256214246212230205213205223245223180228231233209234195244239210230209182225229227223224224214207211202235229240217229221219218222237248213218206230235210214220215233220203220232217232211220209242227239 501100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 989 881000000044 933 4460001 180 739 435000000000630 993 7020000671 579 37900001 411 951 52300002 897 505 46600015 012 385 34000510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97.8 %141 580 81797.8 %2.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

97.4 %140 913 78097.4 %2.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.5 %667 0370.5 %99.5 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %72 361 18650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.2 %139 286 70496.2 %3.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

25.9 %37 449 99725.9 %74.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 730 145135 14578 380158 469113 922120 487146 508191 03881 608141 59363 22555 49189 10193 11247 311107 94077 30687 808118 557150 950161 483153 268179 472137 961236 626387 82024 174708 22235 68533 73880 01872 03131 41487 22234 38134 84266 87080 74620 470122 2511 929 33985 14984 415133 272113 257208 371183 812267 663469 52752 69871 05662 50382 21441 38872 92274 76653 439198 89056 217109 751127 324 429051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.51%99.44%99.51%99.52%99.51%99.52%99.53%99.52%99.52%99.5%99.51%99.53%99.52%99.52%99.51%99.54%99.54%99.51%99.58%99.49%99.53%99.52%99.69%99.51%0.49%0.56%0.49%0.48%0.49%0.48%0.47%0.48%0.48%0.5%0.49%0.47%0.48%0.48%0.49%0.46%0.46%0.49%0.42%0.51%0.47%0.48%0.31%0.49%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped