European Genome-Phenome Archive

File Quality

File InformationEGAF00002339241

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

848 285 125676 560 635403 807 704198 408 99584 753 70132 688 26511 887 3704 297 9391 651 759767 863443 435299 687226 480181 636149 091125 061104 07191 63079 38672 54665 10359 79854 59051 22948 04744 12940 46937 46137 31434 76430 94130 02128 08126 28624 85723 64722 21721 38120 60719 57818 61717 91618 17217 37815 91115 46315 45714 92014 06113 49513 13312 67612 05011 66711 31710 53310 1389 9459 3358 7998 4438 5278 2577 8628 2297 7677 5277 3677 0946 9026 7616 7586 3406 4246 5426 2806 3356 1365 9495 6335 7165 3445 4115 3595 4335 2225 2495 0915 0524 9944 8974 5144 5904 6904 6004 4754 4314 2534 0613 9313 7633 7183 6773 6363 7753 4993 2863 2743 1703 1133 1272 9692 9253 0573 0332 9973 0432 8292 8462 9092 8952 8552 6672 5922 6372 6462 5962 4602 4462 5312 4792 3612 2622 1732 2462 0992 1362 1082 0882 0412 0042 0832 0592 0721 9401 9721 9891 8251 8541 8591 8521 8261 7961 8581 7861 9051 9021 9631 9902 0232 0071 7921 7091 7511 8001 7051 6461 6211 7461 6081 6611 7061 6491 7421 6331 5121 5831 5421 5581 5561 5291 5351 4121 4111 3721 4391 4631 5041 3631 3851 2961 3861 2781 3171 2271 2531 2111 2621 2791 2081 2661 2721 2371 2451 1871 2101 1011 1011 1571 0861 0991 1231 0541 1231 0191 1101 0351 0141 0391 0381 0649828759009349159228509218868568108317728578348097737337787347507507596846916596727067146686706566006385985855966366066966065996396676206126325805955865915876036045986186395845976075635355445955426025955865645745435806336045795876226416595835786065325465415455295294975255165505675595715244905715655915085706115415295436006156155646105455835946105515695335135135356215605155135315155815715565625235935845775425765535585515205465925666196065256286025876055696705735675605445745695955575586065905486075985965756095435535786085725325115314945295095435175005404934904785115234874785065105024714454904825004474544684505044804614635024904784994875084874955025165234565054864574685294765124964905204634844774624825085555025165315175135065164644554684194274384544494254744054414343943763913773523924004084224324614534264213504003783633903883333353353653423703503342933052873233113193103243033222962862962873052792822562992922872822882522702582262112462342112132292051882192282191932101911962331692021982172342351872542131762101801751671951741581742101731571671641751701751581541741912241901431631341611561571721401461591631541651691621471631471581681571691551531691571481561601711601571711661471331311421441421601571591471631601491411301341441331471351201311411211341381341231631431491451381431581231451361651371641731781781501271311211251171291171391621141361061311121241281221291271161161081321141251051311101301171171141191261169912296125113114125103113108105137108110122911101371161191038089113112961221081021121131231091111331161001081051121078210310290108969010810996102113108891009812410810410281979391838890859694898911292959197103938391101979793939391951178886861029410492121869086103103110107921048287941049097998385948982728910991947810182100761029392928298908393988510184818997939177838281939780617875927691867494638390897474791008082858091827686639186757675858877898310365839387908210086877486837710698759890817268768877696862847957757170767675837768647262686968694860564838735858576977606948545959545851567555527154685841305143635041526762595467687970557549705955505862825453645961555759596647547130 620100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 133 066000000037 827 168000313 505 490000000000187 629 3040000219 778 7340000400 989 6810000767 343 8050003 287 127 39200510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %34 494 86499.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %34 457 21899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %37 6460.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %17 269 32050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %34 076 06898.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.6 %2 612 8837.6 %92.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 258 81331 53518 78437 46527 28328 69632 02344 72218 40734 06014 66812 44417 76021 0579 97325 40016 78118 90825 12233 84736 27534 54139 88431 92250 79689 9125 210160 4387 8327 81716 73316 4126 85020 5457 9377 96013 84818 8464 50927 809450 48518 36017 67929 46925 97547 35939 88668 96795 86912 30114 69913 54417 0298 35813 90116 35512 09943 58712 60025 61531 322 706051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M28M30M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.88%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.88%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.88%99.89%99.89%99.92%99.85%0.11%0.12%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.12%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.12%0.11%0.11%0.08%0.15%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped