European Genome-Phenome Archive

File Quality

File InformationEGAF00002146210

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

101 560 605209 644 194321 908 790398 135 589417 459 617383 767 672317 294 367239 973 532168 422 964111 041 66069 450 08841 613 20924 047 46213 609 1937 617 5724 307 4242 515 9781 547 8041 020 495734 984560 159448 092378 789325 447286 952254 795230 570209 693188 616169 723153 587140 293129 187118 897108 45598 81991 33386 20878 63272 97868 62162 35757 00754 32251 43747 87346 01642 86540 08938 91137 15835 44433 61431 75629 85228 11327 49626 41225 00724 17923 18122 42922 43021 94321 37320 93220 55219 55218 65017 99617 37916 75916 42616 20115 61315 23814 96214 33313 94813 80213 89913 67412 89912 68312 37011 83511 76511 71211 76511 30311 10111 00910 65410 48510 02810 0169 7179 6289 5669 0048 9649 1258 8198 6468 7038 3858 0718 0198 0747 7647 6387 4927 2317 1907 0206 9666 7956 8316 5256 4996 3686 4206 0445 9556 1406 0556 1295 9175 8715 8725 9005 7405 6975 3075 6565 3675 2455 2595 2855 0254 9124 9504 9584 9554 6944 7034 6024 5354 5914 5494 4544 4314 3114 2614 3654 2654 1624 1094 0923 9804 1364 1203 9053 9473 7833 7733 7353 6833 7383 5893 6543 5073 4383 4533 3253 4203 2933 2373 2523 2683 1323 3003 0983 1563 0383 0723 0722 9892 8552 9752 7962 8182 8972 8202 7422 9542 8562 7262 8132 8292 7562 7122 5732 6692 5742 5252 6192 7342 7352 4942 4682 4322 4412 4552 3232 3342 4122 4902 3052 4072 3222 3042 3922 3742 3322 1422 1292 1232 0742 0872 0531 9612 0342 0501 9482 0411 9311 9472 0051 9611 8791 9371 9321 9821 9631 9271 9491 8311 8061 8621 7751 8741 8551 8451 8681 8121 8531 7591 7851 7531 7631 6811 7481 7271 7341 6341 6831 6291 6961 5601 5781 7021 5811 5251 5961 5631 5361 5991 5371 5501 5971 5291 5651 6091 5721 5891 5871 4761 4941 4731 4751 4521 4821 4631 4011 3891 4111 4451 3721 4281 4421 3691 3821 3291 3351 3621 3851 3721 4141 4431 4551 5251 3751 4141 4121 3531 2651 3031 2891 2551 2301 1931 2181 2301 2051 2031 1951 1221 1971 1421 1181 1901 1851 2331 1681 0871 1541 1471 0801 1551 1221 1341 1131 1341 1171 0921 0911 1101 1149851 0901 0341 0479901 0731 0129591 0079979529689508899269421 010979998988849983927878873947881870925923895911902886878941908941851870915861927888888838906833906919933796856863810825824817736778755793734796784664755805781764744717718677751734735676644689727703687698697660736711693669723651681662735665698681697727686705714726651625677642686642679611599599600614642623575618630600609602584555557581592592583588628599597609589580612617608603612597629569587587585600583612564590593606617612579547579556614562513485518506497503544559572503487544577543525541519527529504508492516550530520487512530477502510516517542523452531563584562506561566568561545611545602581548574563527531572569544574578544597577531557527533523562579552520540524497520587497502473554571579524559539517522495504535496499509481484498477490513435423448446430443419422426407421395396403397397376400408456390388432401488371441377360368375391361387378389404396330341318364335335354328354353356335334323354325318325324299287326312290319286318271312315321303300279324279252278265282292289291292323276303288266291288282297280288290283289279314330315283276248298315279282308295298315304285324278294341292319287306300285311306298308335281290296309306316337331303328326305326317337323315348345333317339365348314322363332310367361330333298306376318316313287326302333347348281290321278298301306305270298290285281278272314307276305288301280302296314288294285258279253282282289268274283278262309288273262245270260269294259252275261253241288237249245268272236254246255232247231250243255207210212230227212198239206227219270208242237211252244234225245238213210237219215231230185222209214213188210229224214221209202203225200191231187196213208204200203190189185186214209184190205181195183177194178160158181201191171187168160179171176185155177178191175154160183172159194165165145161164188162172157167151181161180154185172173175147169183189157167184182173182143169176162172142161181173179144154147175168151172178174182176208 738100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 130 496000000031 048 414000747 455 184000000000436 446 9810000502 247 71300001 076 536 01300002 218 389 46800012 292 045 38700510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G12G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %114 115 18099.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %113 821 56299.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %293 6180.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %57 305 62850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %112 125 51897.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.2 %8 246 5117.2 %92.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 974 112101 05661 058121 24187 53393 771107 650141 47164 854108 36349 14642 92765 66773 29238 62585 27756 77463 19797 839139 959137 716126 354158 204125 108199 976321 00318 910550 85628 30227 11957 25657 85326 87770 50628 41028 11747 48760 17716 73590 6561 385 72467 32465 278108 66791 805173 934149 155226 349366 10241 50254 55547 35162 69430 79755 13956 67141 010149 93742 68485 951103 185 957051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.74%99.72%99.73%99.73%99.73%99.73%99.74%99.74%99.73%99.73%99.73%99.74%99.73%99.74%99.73%99.77%99.75%99.73%99.77%99.73%99.74%99.74%99.59%99.8%0.26%0.28%0.27%0.27%0.27%0.27%0.26%0.26%0.27%0.27%0.27%0.26%0.27%0.26%0.27%0.23%0.25%0.27%0.23%0.27%0.26%0.26%0.41%0.2%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped