European Genome-Phenome Archive

File Quality

File InformationEGAF00002386197

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

26 016 41862 507 203122 522 813196 968 573269 449 131322 307 645344 064 610333 869 993298 417 259248 362 097194 436 753144 247 093102 086 05369 417 36945 627 88029 153 01918 241 74211 233 9616 908 2764 270 0382 709 0471 787 6611 230 662903 176707 564575 923491 171429 768380 988340 983311 312286 521264 725245 622225 784209 983193 846178 707166 571155 015143 741131 864122 914113 411105 669100 11792 10086 48180 87674 82670 01766 15961 82258 19854 17350 75148 51545 55742 64641 32239 40238 65636 30734 84533 33932 39231 47630 28229 21728 35927 54226 16125 70224 98023 88924 20323 50722 36021 83021 46420 74720 54219 79419 34718 52918 73918 82517 63117 61616 84416 79516 25615 70115 40914 85714 76014 30213 85514 09013 47313 10312 68412 73212 19511 95511 68111 30311 25111 36711 36911 18610 98510 85910 49410 0059 9989 6299 75610 1089 2949 1889 2408 9958 8308 3428 6858 4208 1717 9397 9117 5497 4447 3317 3017 6117 3067 1637 4037 0597 0946 9916 8216 8846 7796 7836 5736 3696 5056 4416 3756 3236 3046 2286 0456 0396 0315 8235 9415 8745 8185 7745 6185 2935 3154 9994 8845 0404 8474 8944 8614 7914 7524 5844 8694 7904 8564 8494 5654 4834 4084 5824 3094 3404 2704 2624 1154 0763 9883 8193 9043 9523 9353 9553 9463 7683 8603 8563 7233 7943 6483 6723 7553 6043 6333 5523 6603 5673 4493 4813 3503 4643 2433 3433 4193 2973 2103 1863 1433 1263 0813 0152 9482 9482 9432 8342 8212 8092 8032 8082 8522 7202 7082 7112 7882 6102 4162 5252 6412 5282 7002 5052 6082 6202 5832 6762 6212 5882 6572 4932 4892 4162 5622 5692 4332 4022 2822 2502 2502 1792 1222 1152 1822 2252 1622 0782 1382 0832 0292 1072 1532 0462 1142 0072 0291 9931 9782 0552 0672 0061 9491 9821 9661 9411 8391 9371 8411 8511 8501 7831 9101 8841 8241 8311 7751 8441 7851 9231 8131 8661 7271 6671 7061 8541 7281 6901 8181 7111 5941 6491 6461 6121 7021 6411 6451 6501 5011 6761 5111 5861 5681 4891 4621 4991 5251 5261 4571 4831 5591 5521 4671 5021 5701 4931 4361 4521 4261 4991 5551 4061 3631 4061 3141 3281 4091 3721 3381 3741 3321 3511 2971 4271 3271 2921 2961 2971 3271 2491 2421 2041 2331 1821 1691 0961 1481 1691 2191 2291 2411 2341 2331 2741 2391 1761 2121 1611 1311 1331 1211 1101 1491 1081 1191 1001 1201 1311 0791 1411 0921 0941 0741 1271 1141 0961 1121 1271 0411 0441 0311 0581 0291 0661 0049901 0019669829951 0229389791 0291 0391 0459969951 0781 0449901 005973944935944860921918968876879923940908898881860855910814930840832840852887895785795787804840779878800867851892826855835877867817831754821771781808776787781797784872809746760700744762749737785742746713749736754702705714744742757716736713740768648669723679686661628633630653573637694648653661604613579631739559572573585581609564610550566580583566559559575614612588581595582575588598635561581543584583603616627586598540565524535537540550553569516579496561553517551524566566508541554596555582510528501472490528474501499499510435459493475500468486472483462461494456457502458454483458435435453457449502454441457441503420468468468464427459433408415470406440416403419372411452420413409401429380418399407452396402409411421418410415420411430417414430402419441418450422444401472480450432475462439394401414442431458429421420436441463425447390397432381432411362400417414436423427443407439442429357359433369373373367362391355413405393387380402386380368396404420430398377423405395384386413445461423443432418365402405380399418399410415405407336382387427375375403405402373376381395386402363373393407366372377389407402361387428404391405381334377340346349366378361308365388373392365383352346336352307322356352319320345358349321326308335338292294326314316280315308301290299300309297284327293298324308251284277296295317294297338302308310304291297287319257296280303254270319304291294295310289269262256272277259263275266242259241214262265216278237234217236264213213252224214241268241224217237276234202233201205233204209188214241214225217214212223219245198238237216212202210205205179213217230246225209221227185203198212218208195219222217221191233205214213196237197192210206238199246199198225214189217186229221216221170206219180186188177207205175179197199195208197188268 670100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

5 023 509000000017 107 642000885 645 244000000000515 833 0030000565 125 18400001 288 150 49200002 801 446 49500018 387 242 29900510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %161 595 19299.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %161 225 17899.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %370 0140.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %81 011 83450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %158 968 66898.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.9 %16 074 7769.9 %90.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 140 530149 84493 634179 755135 965142 696160 168220 16797 439166 99074 39964 02392 389101 47053 896121 79082 15990 271123 745168 591188 954174 194210 078153 596244 435409 15127 233762 34041 07439 46486 19781 67036 89997 66640 93939 60367 14289 20522 797136 1352 117 11290 38282 584141 476119 235221 934201 502281 995506 94055 62673 41261 03084 07638 81664 42772 36750 310205 74752 127110 643146 417 376051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.76%99.76%99.76%99.77%99.76%99.76%99.77%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.76%99.78%99.77%99.76%99.79%99.75%99.76%99.77%99.65%99.72%0.24%0.24%0.24%0.23%0.24%0.24%0.23%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.24%0.22%0.23%0.24%0.21%0.25%0.24%0.23%0.35%0.28%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped