European Genome-Phenome Archive

File Quality

File InformationEGAF00003293025

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

6 247 61811 614 05124 365 28947 405 50782 158 742126 543 035175 420 896221 214 635256 740 820276 646 456279 015 266265 051 827238 837 868204 937 165168 383 639132 973 089101 292 12174 689 78253 478 88837 370 01925 523 18017 153 29011 365 3547 486 4274 957 2083 328 1042 275 8591 625 0391 207 213947 087766 757647 127561 025500 364446 607403 684369 401338 919314 741290 672267 492247 315229 809213 819201 542188 065177 328167 284157 570150 356142 587136 079129 574125 443119 911115 142109 973104 852100 96797 11592 86988 62984 65381 65378 05274 79071 52068 45065 00162 25758 66957 19054 15852 28650 32647 86546 81944 28242 44941 03540 07338 52736 35935 09733 80632 97631 58631 50730 86029 86628 51427 40526 60925 85524 91624 53324 00723 11422 40722 37021 66121 35820 64220 07119 71319 23318 43018 09017 90817 82217 34417 04517 02916 43816 17115 64215 31915 29414 75014 37114 38913 82213 69413 71813 42913 00012 79912 39811 99711 74211 32311 32511 06911 01410 58810 36310 39710 2389 96710 1159 8049 8259 7209 5739 2799 2689 4888 9278 9198 6428 5308 3968 4458 2508 2507 9048 0057 9527 8597 4397 5197 5886 9596 9016 6556 7086 5886 5846 4756 6486 3576 3896 2006 2966 2516 1006 1286 1176 0086 0335 8736 0655 8385 6515 7425 5025 5165 2685 3825 4935 2225 5305 3135 2915 5235 4955 2835 1525 0885 1015 0315 0355 1275 0504 9244 9964 6824 7984 9774 7034 5414 3244 3734 3914 2974 2744 2844 2004 0493 9574 0384 0414 0374 1624 0944 0763 8664 0324 0073 9174 0593 8834 0463 9913 9133 8783 8963 8913 7373 7763 6193 7503 5213 5283 4823 4533 4233 2753 4743 4333 3503 2503 4323 2383 3413 2203 1123 2453 1623 2153 0153 0183 0022 9302 9652 9112 9552 8382 8553 0203 0923 1393 0052 9692 8732 9192 9142 7892 7642 7542 7972 7072 7482 6362 7592 8012 6472 6342 6302 5082 6142 6572 4662 5562 5392 4992 4242 3862 3392 2952 3002 3862 3302 2292 3502 2622 4162 4622 3242 4062 3152 3212 3502 2442 2222 2782 2922 2182 2292 1952 1882 1622 1432 0592 0812 0992 1162 0792 0282 0452 0181 9791 8911 8941 9561 8871 9061 8931 8461 8821 9061 8801 8861 8611 8221 7621 7561 8611 7941 7711 7601 7891 8281 7131 7951 7781 7691 7461 8481 7931 7201 6941 7341 6891 6721 7561 7101 7691 8761 7431 7681 7661 7051 7201 6761 7601 8001 7311 7451 8091 7251 6531 6671 5861 5591 5761 6201 6241 4551 5811 5071 5371 5441 5461 5491 4491 3871 4741 5341 4081 4521 4291 4951 4141 4941 4501 4621 4491 4051 4081 3441 3721 2631 2651 2471 2871 3131 2861 3481 3001 2871 2871 2681 1961 2741 2141 2561 2581 2161 2871 3411 2461 2171 2091 1701 2061 1591 1841 1951 1841 2221 1501 2401 2971 1651 1501 1511 1181 0541 1131 1251 1231 1891 1231 1571 1521 0991 2011 1281 1561 1071 0091 0611 0451 0361 0231 0479801 0521 0101 0211 027995988905922941973982952940874964935934913865881898872827885842898793903899948937896947905879873833862846913848851860827763784784799827792820816903768793782762829805804765826821765813787803808840707776782815840820737768747810778736734723770731743704740774739758692702740696741706689702706719775775743694710725732690692758712721685723727688656660692681667581604637625610592633659652667669639648718712656620662649645621637682627588602606611639576590642609642545589591598577584555609575640617597613577609596594580619624593608561610593603562558530600592616624596575546592575570527552546535546541523534522552514559512535492525490492517487533449495478483489498472483433507418501466483455504472463443440486465468471460489469401429454450486547460457471443454460468448453473477495468451457473492483527467442390470437448437432427405394392420404382399435394413389373395397361381359399360423375429405374418429383389374413367398369381390406395368410387361387377348366398398406398419419372413405373383380345378330351397326358340360346378354334322361343314314308377329311326339349366378349355353369362341344363352338345319348342340364349328343315331307351318312375336334310305311343286316355327336330326333324317332283303309284280322340317267283299320275288316313300323276302276295283299301299325264260301276291283268290288252305280280291264289294298266286286280298265266261242273263286285288267292277312281247274255277264249269251269274260266306279259254245281256281247278236266290293291255301271230255262259280298268254273265269281283269291262239259246274251267266277251277266280262286310289299232374 500100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 928 379000000067 305 4330001 284 039 002000000000736 487 4240000847 072 11800001 901 936 06800004 172 950 93700026 090 828 36500510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G22G24G26G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %231 805 91699.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %231 463 98299.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %341 9340.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %116 236 91350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %228 103 65098.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.4 %28 933 77012.4 %87.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

9 504 961206 910127 808249 424183 854191 784217 657287 688131 723222 896102 43790 036126 782144 74476 169173 856112 525128 845182 949259 899271 360249 776308 541238 336377 350616 10237 5251 099 15555 79753 793111 661112 59051 398138 65358 46057 30293 369121 34333 545187 8262 827 974130 370122 253211 511179 023333 862292 872436 978741 11178 994107 48292 037122 92255 589104 852107 48474 866300 32077 515164 125210 055 445051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.85%99.84%99.84%99.84%99.84%99.84%99.85%99.85%99.85%99.85%99.85%99.85%99.84%99.85%99.85%99.87%99.86%99.84%99.88%99.86%99.85%99.84%99.72%99.76%0.15%0.16%0.16%0.16%0.16%0.16%0.15%0.15%0.15%0.15%0.15%0.15%0.16%0.15%0.15%0.13%0.14%0.16%0.12%0.14%0.15%0.16%0.28%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped