European Genome-Phenome Archive

File Quality

File InformationEGAF00000643109

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

298 956 64950 310 26112 216 1196 701 9635 004 8934 235 7733 751 6433 395 7103 130 3152 900 2702 723 5442 552 3942 404 6162 278 4792 165 0422 061 8851 963 4031 872 5801 791 2171 714 1671 638 0141 565 9211 495 4031 426 5511 364 4891 302 0291 238 9691 185 2751 133 2541 079 1231 028 910982 411937 150894 295852 263806 168766 259729 221692 826656 579625 388598 362570 460541 977517 645494 798472 490449 313430 177410 489391 716374 397357 731341 209327 170313 223297 265285 147273 147262 082249 436239 863230 001219 338210 564202 224195 365187 776181 379174 799166 992161 406154 565149 041143 689137 951132 254127 342122 688118 418113 958109 804106 446102 39998 99995 28192 15388 65285 55583 01979 82977 47074 37471 80269 18767 57365 65563 73761 35259 90257 68356 13554 34552 55451 16749 28947 67546 09044 91643 31141 63240 34539 21538 44137 00435 86134 89034 11233 17232 44431 28330 32929 38028 58127 61526 60125 57225 34624 49723 84623 13022 27621 78121 14620 51319 78819 41218 57818 06917 59817 27916 93016 24315 88715 62915 22014 83214 37713 89513 54413 39913 26312 48912 54711 99811 83411 42411 13010 97010 77310 51610 25110 0059 6739 5309 2809 1338 9248 7648 4748 6188 3108 1147 7998 0417 6527 2937 3027 1017 0837 0526 8756 7736 5876 5046 2386 3596 1895 9585 7875 7845 8895 6405 6415 5365 5325 4245 1615 2555 1945 0224 8534 8904 8434 8444 6134 6054 5274 5144 3984 2884 2104 1174 0143 9503 8183 8333 5913 7183 5163 4423 4523 2773 3303 3253 2703 1903 1413 1693 0613 1203 1893 0222 8722 8442 7492 7222 6082 6442 6082 5642 5672 4162 3982 4572 4622 3822 3652 3532 4302 3292 3422 2902 1972 2042 1252 1362 1042 0742 0912 0052 0702 0122 0621 9792 0031 9472 0141 9131 9211 9551 8221 8341 8411 7411 7481 7931 9121 7791 7341 6841 8111 7211 7081 6551 7211 7371 7201 7311 6471 6421 7051 6561 6481 6231 6171 6051 6001 6131 5591 5221 5261 4521 4541 4811 4631 4791 3571 3761 4841 4851 4111 4461 3221 3271 3881 3411 2991 3411 3541 2921 2821 2521 2711 2381 2211 2701 2041 1461 1821 1531 1571 1391 1101 1101 1501 0401 0791 1031 1381 1071 0901 0991 0831 0751 0141 0431 0001 0311 026997994955988993954950984900942991938909888886898916898845897859889900899867852890861837830802783855772788774741726710721716713697720612679686702648669662686635631611616606630608602589589597569580599521594521544547606551552537547564554510540526497483445441468469504468407448437425385432428420397356398355418392415384385381369377367325383385345373358309322310304337338304313309307288284239284284303262260289267270303243256246264270275266267267239267254231251242233241268269241270226234261232236241272261251255245266245258257246240235239236270269233258250278226224245223259250256229190229227228206236225227231246242253223215209234206238217216211201220216221210236180180198201185201205235211203201196230203173220205187160173210190166180174176188221188186193195178183171183180181181169186185195204203200186175175169205195208197203180196186169189215224184187190201176184179179206181179174185177179199180177179182184190169186165198168143157146174157167172165134147166157182177161155161160184158157150158166149153158171166146159152174169146146119145130146127145148142142153135154161133125117135127133118112125141154120126139138130132147121129148140114128120121126133115105126121141117130106138126111112111119111113125129104117134104109117123100951169811110710010711310583979979101828997841101129699766385779791798592697098103938078858582969169977499827278837591968482867671826878888484918266676075787179515754677671706382757175626471515565736637746263697371756871707368535957596648576259645361716771626280706154568158536661674555564645423549584539474139524942465446545946425456324454434850414748604652526543414736403741583551385326465236483649444038525037453138413446454337403650383942403837414631465144273939563737434038385337323310 251100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

000002 707 4733 028 52813 509 15812 408 0503 842 5694 207 0952 257 3462 066 4374 854 6822 178 7444 309 9643 511 0984 423 01911 700 0566 417 3727 946 7575 549 36211 082 88112 047 15412 370 37913 960 19923 015 45319 087 86423 897 77826 721 99544 029 29152 810 93478 541 62673 289 834187 981 065143 799 333106 428 372270 310 837219 945 681324 641 811418 040 374680 168 32900510152025303540Phred quality score0M50M100M150M200M250M300M350M400M450M500M550M600M650M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.8 %37 381 58498.8 %1.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.5 %37 255 48698.5 %1.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %126 0980.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %18 913 92650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %37 205 67898.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.6 %994 1402.6 %97.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 557 6214 1543 4107 2484 11811 55511 56618 41214 95442 73633 82010 59177 18415 24012 976162 46927 67396 81456 74822 951117 5921 15570 677250 5541 26619 6111 8391 8301 6491 239 1594 0053 3783 6664 9705 0406 724156 478453 20513 8903 80424 61213 3922 71037 7302 6244 40699 6904 1809 4706 50817 4487 02423 12221 68033 57261 038145 57229 760 342051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M28M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.65%99.69%99.82%99.45%99.77%99.82%99.59%99.66%99.41%99.39%99.82%99.88%99.84%99.82%99.51%99.41%99.69%99.73%99.81%99.85%99.66%99.66%93.86%99.52%0.35%0.31%0.18%0.55%0.23%0.18%0.41%0.34%0.59%0.61%0.18%0.12%0.16%0.18%0.49%0.59%0.31%0.27%0.19%0.15%0.34%0.34%6.14%0.48%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped