European Genome-Phenome Archive

File Quality

File InformationEGAF00002027608

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

37 716 51472 341 287128 548 586198 189 290266 420 589316 927 193338 316 126328 616 111293 725 302244 512 151191 330 485142 154 238101 357 79869 956 19647 052 81831 216 68420 543 12913 515 2548 914 1895 922 3573 953 4362 678 7011 856 5281 315 721961 163733 024575 777475 219399 200346 915303 154272 375245 075220 639205 589187 896173 227159 068147 326135 766128 706118 216110 276102 08794 73788 52582 17676 85372 22567 23663 08059 82957 81754 77951 00349 24547 07944 74543 24741 77939 58938 31936 80335 23633 90033 00932 13531 24530 66329 23528 65227 69626 88926 18725 40424 37724 07923 38723 11722 33321 53921 10720 58619 79819 83519 16718 81718 28617 27117 22816 68816 52316 41015 98815 34614 65114 44614 58213 99913 75113 52113 56212 94913 01412 81112 42112 32611 69911 82412 09111 44011 48611 43811 06410 76010 68510 68910 16010 27610 0419 7749 4459 2319 4529 0779 0118 7478 8858 4638 3678 2238 1107 9177 6297 5957 8627 4947 4097 4937 1616 9877 0317 1816 7926 6416 4726 2016 3906 0615 9705 9336 0335 8915 6595 9285 8305 8165 6485 4365 2885 3115 5915 6165 2865 2834 9915 0654 9995 0564 8084 7954 9545 0764 8994 8774 4364 4924 5204 4464 3294 2604 2714 3144 1684 1174 1104 0013 9273 8753 8143 7313 7113 5763 7273 7033 6123 7453 6293 7643 7993 9533 6423 5873 5303 3453 3503 2243 2253 2603 1282 9882 9983 0333 0753 0273 0432 8813 1663 0693 0032 9213 0032 8203 0652 9042 8492 8332 7562 9492 7702 7592 7342 5392 7062 4832 5822 4822 5712 5242 4812 4702 3802 3702 3752 4152 3232 3232 3392 3102 3482 3432 3242 3332 2052 0842 1872 1572 1822 1252 1522 1842 1182 1222 0691 9871 9741 9972 0552 0811 9671 9541 9331 9161 9641 8981 8321 7811 8311 8901 7931 7031 6681 6721 6171 6631 6801 6641 6801 5801 6581 6701 6001 7161 6371 5291 5321 5111 5231 6141 5981 6321 6681 5471 5431 5901 5061 4521 5231 4191 4881 4741 4881 4471 4471 3901 4551 3531 4341 4121 4391 4751 4401 4671 5181 5231 5231 4911 4781 3541 3901 3071 3871 3041 2911 2701 3011 3691 2551 2551 2421 2371 1431 2431 2251 2211 0801 0991 1881 1031 1051 1861 1601 1191 1191 1281 2581 1341 1651 1561 2251 0691 1581 0411 0981 0851 1341 1001 1291 0491 0361 0239991 0051 0231 0331 0181 0281 0101 0439771 0191 0039701 0371 0851 0291 0089819459979349219769309959589651 0441 006832861909902921907864868857853931859890838825820838834864815820809845855866823845868813927842867852847797832815803843859804833806888854883849825812846789792761810753753807772752805788813808798781830830840851788809765778739759690675804710642685735698693677725603676648703704700682657679742678705713700719706693685714695725656650662638630619607632647636584639592587596719628671661642676679608590647632568581589584579527538564542610554574594544569589576558566490518479524553490485489531545533492531503504464441472489537446493486498483460531467485479486446442437440445449472422477435436454431453430450428396408403373380423404429413390419407436440406389443433402401411458388399388350406370393373385354405383413398404416407398412392399361376404398344337429416394370397385389339359342402315370346382342342361347365404408449397383352372359374387355418368345386360386372370364337367335364327363358365309338297342333346368339334343331313353292319320315264294280308294327341295314319259301320279309328300335331304332292308316301322297352314293307326338315281281305308283378336298276304283296274318275296281259279279272290298264279285273299298279295287300319297319321353308320283335302274280265228269266257275272240266270251248229258270250240275248242253270253245240238228241238237242223246226266256231239236217237230234256217221253269237260227248222214241259229303210216201252253219243232227225253215240238229248256211251229211234224201230220219229201221211215215218188191207235227224220255216222273246205231258228225198236229237224223231206215255238244247233215215212224228221200205261204232234187215212246237216205244230257245273257225216225204207227203213212213202190229236208193196197187168180169152165192175180181164150193155180172179172145184191166201183184171173155178193177167170180155166193197164198217207275 520100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 066 5420000000217 487 1360002 141 764 0470000000001 173 392 78400001 237 724 29400001 989 597 25900003 790 263 62400014 156 904 30000510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %162 284 70899.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.4 %161 005 63498.4 %1.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.8 %1 279 0740.8 %99.2 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %81 818 54350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.8 %158 404 93296.8 %3.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.1 %9 953 6376.1 %93.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 522 589158 78899 204188 478140 307145 413191 767217 252100 150168 70379 77969 803120 361108 09857 346127 46087 954100 016155 682182 248189 542179 707215 988164 152283 466435 38729 857762 02143 99942 833102 40086 65742 241103 21044 35844 12388 86593 51526 296138 8952 402 185108 963121 914165 593151 039257 348224 658335 506462 07682 56392 06687 641107 03672 554116 321111 37994 453237 807103 635160 327145 625 004051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.18%99.19%99.21%99.24%99.21%99.21%99.2%99.2%99.18%99.17%99.17%99.19%99.22%99.2%99.17%99.22%99.15%99.2%99.15%99.11%99.21%99.21%99.36%98.83%0.82%0.81%0.79%0.76%0.79%0.79%0.8%0.8%0.82%0.83%0.83%0.81%0.78%0.8%0.83%0.78%0.85%0.8%0.85%0.89%0.79%0.79%0.64%1.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped