European Genome-Phenome Archive

File Quality

File InformationEGAF00000660075

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

82 645 83736 663 45312 511 8646 235 2122 502 9561 734 733930 717700 919467 530363 494276 320220 203176 500147 759123 308104 98490 34776 66268 90561 33055 47549 22144 99441 64337 92134 96632 14129 79728 37726 59424 81323 50821 79820 76519 45318 65417 50717 04516 15115 50015 07814 74314 51813 81213 43813 08913 03712 32311 85011 91211 55211 06611 19910 71110 47410 28210 36310 1029 5819 4579 3139 2969 2879 0068 7008 7508 4718 4518 2948 2078 0228 0547 9577 7497 7277 5717 3927 4107 2627 2557 1907 0546 9356 8086 8736 7886 8846 7666 7026 5776 4606 3506 4816 4396 3366 3386 3336 1956 1765 9975 9496 0815 8115 8665 8315 6915 7355 7345 5495 6555 6675 7085 7285 6065 5025 5315 3955 4105 2965 2445 2895 3305 0855 0625 1384 9425 0544 9724 8754 7994 7994 8174 7904 7814 7594 7654 7114 7194 6784 6304 5754 5284 5084 5074 4734 3804 3574 5064 4264 3644 3774 3164 4094 2584 3444 2654 2384 1614 0924 1664 1814 0714 1434 1474 1404 0604 0024 0784 0894 0713 9193 9173 8433 9033 8493 7783 7943 8063 8383 8403 7953 8303 8193 7593 6763 7233 7663 7293 7063 7323 7103 6933 7053 7263 6263 6033 6983 6023 6243 6153 6043 7033 6353 6513 5053 7093 5913 4733 5463 5913 4243 5053 3443 4363 4033 4443 4443 4263 3723 2703 2623 2943 2673 2983 2123 2493 1663 2663 1943 2633 3073 3353 3203 4213 2183 0803 1683 2043 1923 1243 0233 0752 9762 9922 9973 0113 0112 9783 0223 0542 9842 9472 9363 0142 8502 8172 8412 9793 0532 8312 8862 8332 8392 9112 8232 7522 8052 8472 8282 7232 7992 7672 8202 6362 6532 6832 7352 6932 7062 6252 7062 6042 6432 6472 5762 5612 6172 5782 4772 5332 5332 4392 5302 4742 4772 4262 4112 3182 4092 3862 3222 3562 3452 2792 3122 2162 2212 2362 3112 3052 2982 2942 2772 2202 1832 1442 1322 1162 1642 0962 1092 0272 0862 1562 0942 1092 0752 1002 1142 0222 0792 0612 0602 0312 0031 9612 0211 9952 0371 9141 9622 0421 8881 9271 8631 8861 8021 8011 8371 8901 8911 8261 7691 7801 7961 7051 6531 6671 7531 6861 6921 6861 7171 7131 6781 6261 6131 6051 6201 5681 5621 5621 5181 4851 4661 4711 4851 4701 4781 4841 4661 4321 4321 4601 4421 3861 3701 4111 4121 4621 4261 4621 4331 4271 4091 3561 3631 3191 3921 3441 2851 3481 3051 3451 3021 2581 3021 3551 2441 2651 2601 2111 2131 2361 1961 2831 1251 1661 1601 2001 1761 2121 1241 2321 1511 1101 1151 1001 1361 0831 0541 0571 0131 0371 0421 0081 1211 0781 0271 0119921 0161 0001 0179859959719559549309809709359269479338999448748838709448728948508728648298128067768228058308117848078578218418368057818157927947838198108877568008697707768197757777407457547247557087377397317366796696646306215896546225766106036195856235916135946105735525925595465345155525045375375505765435005514985064754924764824804654634174544644324174154173834044003983423914343923653854113983653563873964053863513753553533763453593373223563383063263113103183152913092982752862732722612722572242752462542492232592152092202152041932201891872032101922191661871951841781781551651791811581571481671521431391431361371511391541611401641321331501371371591491441191131201019998101102105103104104978486107809988968893949587808788809078896499907872698874725871635565566865616364665557595981635757455252554046455546434647504644404658365146494338433447344437263538344227383134362936312624292318272921291814232724202124282227232222173226251925252521133016282322192918162618233017202516292814181622161913211718142320262215231925122231252120251816211915202023251422221924171915162121312819292726261619282016252620102222242318201724162622181515222220302016151016172517161222201718161416151114121812151016913101214915851991617131110412127121716161194812161210109111211116151278119126161113756910651499913119512810991181589912411894 498100200300400500600700800900>1000Coverage value101001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

434 4840000650 742336 5231 931 5651 321 814450 665833 473333 250356 269604 199243 907782 054568 022676 7751 215 030645 5621 096 158668 5261 226 4941 738 2111 849 5732 106 3403 482 4773 363 9603 141 1673 547 6997 757 88213 219 9757 784 08214 247 96531 059 79042 630 76421 619 32062 315 87840 750 31375 547 45087 864 277194 267 81500510152025303540Phred quality score0M20M40M60M80M100M120M140M160M180M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %8 352 22899 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %8 330 46098.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %21 7680.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %4 217 80350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %8 301 46298.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

27.6 %2 332 32627.6 %72.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

359 7991 2346642 2049271 2629751 7081 7859 4637 0932 86812 2293 0432 18834 6033 64714 23712 0703 45620 63566418 02365 4976093 934407514614274 1631 1438949741 1801 0161 66437 459135 1494 5381 4865 3404 7641 1809 5762 0102 84834 1184 5884 9646 31610 5364 90613 68610 29214 99026 20049 6507 183 624051015202530354045505560Phred quality score1M2M3M4M5M6M7M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped