European Genome-Phenome Archive

File Quality

File InformationEGAF00000140968

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

931 843 726616 245 895303 443 671121 855 69742 241 12213 396 9384 235 7641 540 435729 368452 954324 867251 346196 890163 442139 274119 928105 11393 23383 50373 55766 57561 53757 48653 63149 04947 08843 13240 60237 96635 92234 60132 95930 69029 19228 37827 70326 27024 58323 64522 52321 94121 22320 82119 86119 29318 37118 31317 90517 47816 76615 97315 87515 92014 64714 45713 75413 78213 14312 78212 45311 74211 37811 20410 78310 40010 1219 8919 91610 0269 2598 9718 6228 4858 1318 0007 6467 4897 4207 2247 0096 7426 5096 5006 3336 3296 0396 0026 0665 7695 7305 4575 1904 9224 8084 7564 8344 7934 6394 6684 7194 4174 4094 3154 2924 4264 1334 1343 9003 9053 8393 8003 9463 6843 6693 7263 6463 4833 5743 4743 4043 2483 2933 2343 1823 2183 2483 2783 1773 1093 1272 9283 0262 9622 8692 8402 8512 8062 6852 6472 6982 4942 4822 4012 3502 4392 3562 4582 1372 2282 2862 2332 1932 0972 1542 0972 2302 0102 0892 1192 1072 0302 0742 0552 0732 0512 1002 0072 0201 9991 9791 8921 8531 9381 8871 8461 8261 8291 9181 8561 8491 7881 6871 8031 7781 7881 7561 7481 8131 7381 7471 7841 6191 6021 6281 6291 6241 6351 4891 5881 5601 5121 4621 4641 5391 5111 5301 4621 3961 4741 4681 3511 4771 4021 3501 3671 3321 3201 3551 3451 2801 2681 2771 2361 3011 3161 2211 2751 2661 2631 2321 3141 1511 1951 1341 1951 1261 0331 0711 1541 1451 0821 1091 0581 0641 0301 0171 0141 0791 0591 0891 0751 0479621 0361 0341 0139991 0071 0121 0311 01499897889087887478785383683478673379980180782277780781178475775573970670776173075772066369866167168070965365773468267165467969572072867463562967065860162365057856656057256256258557652249157356358454454356754054054153051752151852852652148247050955347755151948448349544849452051554050848846349648252348147554049145448448841751544845739741343646741542943042241040243640339845644741738837541039542141940336841238440937941539240941340843637840440943338441343041038838437133835937533935339034339834135838035739437934836636039138638238235434537641734835935634537235938033738839337537737633432934132935534834129938734735334035735130730533234933332934531432133131834732034633038635035732131038435334637431232334931933335733933233134833531731934434636136731830729931433132429532032929930528429428330531229629328330429633130831127130230729533429129228430827929026525931127631130327528428625824528128724125325624026429328329326529225526924427026328828725924423025424725124827127525023023421825126721325623323323925922323221221521826623423423725924723925123422623225122421118919621123722820120420620820220818119220819521021618119120023220219020319418320320317919318719321322620419820421019917418022418821417318718920818221217518419819919220519119622820020719520721720620919216718917418916217518817120519417517618415316717019017216818018418216815916719118317718518115017318417115916818916917918617919316115515317816416816817315815718515514215515316120014314818117116714113216913016215618316214913414714812114115013714416214712014512413915115113114913813116412112611012214512314513415313915411811913014214015012914214413414113412612814213210711712513412512013512911213412214512111412812812212010911612711912312011012412812510414210412712013510414512413112112411813212712912312911713012111697121102127110115116119105121103119101107101979510610711197941068794951109792801061109299871039810094821109510483861197882909110110711511611296978898831081139110890101971088982966410510593107868995979693921057976789087878582758085791079380949084978594798766878881768287899889871089779867889787497989396809889908260919179951069075948879909144 371100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

519 4669 89762 519768 797906 1912 374 3858 696 2536 934 6522 620 2435 423 92628 682 52718 156 91912 758 7948 398 6503 035 0791 317 3593 396 6659 217 42818 297 14512 289 65731 631 87824 028 06217 137 91111 949 64451 541 80517 606 53318 492 57531 044 19328 876 79344 012 52035 757 52575 027 68383 543 538151 380 199153 591 154224 935 602437 622 007797 141 1811 252 394 182510 842 114103 916 65126 897 1885 956 1591 223 108726 8430051015202530354045Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.4 %42 146 03998.4 %1.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.1 %41 978 22498.1 %1.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %167 8150.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %21 405 71850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %41 642 48497.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

0.7 %285 0910.7 %99.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 613 07922 74310 33055 40812 48314 26613 67818 55317 432205 66249 01129 06383 01627 27423 938184 74831 968119 92171 99215 447113 6075 50446 149239 4083 37628 8023 2553 2793 2521 981 5954 2333 2783 7964 8443 8705 534107 121515 10411 1985 35016 50413 5044 98025 9627 0009 29465 51812 62614 17814 91227 81410 65029 73422 60032 68852 364104 36634 634 175051015202530354045505560Phred quality score5M10M15M20M25M30M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped