European Genome-Phenome Archive

File Quality

File InformationEGAF00000660599

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

588 149 937155 220 76139 666 47513 494 8566 526 5714 359 3303 446 5122 944 3762 624 7422 388 5382 203 5582 049 4361 925 8511 820 7321 727 2261 647 7421 580 4661 515 4611 460 7821 413 0781 364 2541 319 6961 281 5051 244 2371 212 3561 181 6791 151 1601 120 6331 092 6531 068 3871 045 6951 016 914995 476970 674949 558930 554909 947895 647877 920860 933842 988824 745809 377791 888776 930762 231745 228731 861716 999703 821691 065677 114661 944649 896637 997626 730613 528601 376589 677578 287565 545554 726543 258533 676521 726511 741500 518491 514480 895473 110462 560453 548443 409435 142425 989418 572410 327401 418395 234385 541376 998370 302363 888355 767347 845341 191334 773328 283321 043314 390308 805303 865297 943291 531284 130279 358274 435268 423263 049258 124253 617248 473243 922238 685234 752229 332225 800220 883216 635212 513209 151205 987200 729198 272193 787190 205186 252182 963178 863176 556172 690169 671165 700162 981160 573157 831154 271151 642149 662146 684144 030141 935139 630136 969134 787132 055128 288125 406124 192121 790119 303117 406115 557113 006111 231109 168106 711105 429104 034102 348100 41498 43597 07995 58493 76391 94490 70789 65487 91186 18185 35983 87381 97480 91479 67178 10677 70576 20274 47773 72372 28271 04270 46069 17068 23667 15166 04464 13663 75862 61962 04760 88360 10459 24858 12757 59056 23755 37954 44653 59152 86251 78151 24750 57949 74949 01948 23447 91246 83346 00945 13044 54243 72342 67242 50841 94741 17040 33040 05839 35938 63838 08437 46437 19036 56236 47936 06935 30034 74734 74433 93833 46232 93532 38832 12131 53631 26730 45230 08829 74929 83129 17228 84228 27727 84127 64727 25926 71226 55425 88325 74725 25024 96924 89924 28224 02623 84223 22023 10122 60222 73122 06221 95221 60921 42420 94020 62520 33520 21919 77019 73019 59019 27519 03418 44017 97918 09018 10117 66417 42417 48617 06616 86616 71616 41016 10816 11115 87415 56515 71215 09415 10514 84714 95014 79514 41114 58714 14214 16613 75113 65313 43513 21413 17413 18613 00012 41712 40112 43912 27712 24911 97311 78311 64711 56011 42911 41611 27911 20411 01810 87610 93110 69510 45710 48510 27510 21610 0709 9389 8469 5879 7599 4579 3969 1349 1608 9738 7538 7228 7428 5658 4918 3118 4698 1988 1318 1818 0487 9417 7467 6437 5667 4807 3777 4557 3027 1766 9706 9386 8866 7436 6916 4676 5636 3916 3456 2706 1656 0926 0285 9555 9996 0175 9075 8285 6315 7075 6875 6685 5645 4625 4155 3565 3585 2525 2425 1465 0464 9534 9924 9174 7954 6964 6684 6774 6244 5734 3734 4844 4664 3494 3164 3234 3854 1664 2524 1614 3084 2094 1524 0784 1144 0604 1393 9733 8723 7993 8153 8643 7013 6183 5133 6103 5663 6093 4983 5423 4963 3723 4113 3333 3383 3593 3293 2723 2203 2433 2383 2503 2003 2383 1723 1453 1083 1842 9873 0442 9713 0173 0092 9412 8482 8992 7992 7682 9232 7782 7172 6832 7932 7162 6732 7082 6602 7292 6942 6752 6842 7212 6792 5512 5582 4712 4642 5882 4522 4182 4522 2712 2602 1952 3252 2622 2702 3302 3072 2652 3052 2572 2502 2122 1302 1322 0832 0762 1562 0482 0772 0681 9901 9112 0322 0342 0151 9111 9341 9401 8851 9381 8701 8551 8831 9551 9021 8771 8361 8061 7971 8271 7301 7971 7571 7641 6841 7441 7101 6481 6891 6581 6771 5581 5751 5401 5811 5761 5301 5701 4981 5511 5031 4931 4421 4661 5321 4371 4181 3881 4531 4681 3711 3851 4071 3661 3921 3241 3721 3201 3291 3791 3221 3081 3171 3121 3281 2891 3211 3581 3491 2681 2851 2841 2541 2651 2941 2281 1871 2281 2761 1701 1161 1801 1871 1511 1311 1011 1481 0991 1311 1211 0881 0361 0841 1101 0901 0911 0939951 0231 0451 0741 0541 0611 0631 1101 0841 0781 0821 0491 0531 0461 0441 0839541 0389829431 0189819839689569911 00395596391190196787493598685692792992889787090989991989489584589182482083779580083188779182186780482181881282876681386882675379479275874076978676073576573779671476777777370866171470366970372669769267070166268064269569166561965167264864565066767860063465760762865361361763961862560562259262264359960653559057258556356453049954457053554453054354451551452447348251045647546746249550949446746745448641649647044944346345148842045445245343842244543444844545040744742345343444640244844245143842940942244038743141143642138238742441939039840837938442939338834937938638734337036935735634436932533735336234235033632932032534032634031931832933028230232127327930430031130332030528630329030332528831629427828329129429731531628330432528927228028728528327628231130331228631529328632728330932528331027328926830327728729128226027731928228128629728728326826527227626125423629227125929024027924226627723128425226525725724128924425226724125527224226328025023224623924123422426321525623525124526726527228128426028125523924826322626023223323724822421425021623026024223923322523024520022924423924822619622920721820421417920921920921518219620118518917119917019517816016919016017117315117468 457100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

003 021 699004 670 7255 884 38924 200 58024 674 87611 637 02314 765 5686 428 6264 596 33811 286 4404 637 11011 067 71310 953 05111 399 53620 948 94913 028 80614 007 45610 450 67319 509 08527 718 42831 856 54837 277 15558 706 88855 017 61551 390 04454 965 294121 894 492192 970 060116 325 373191 363 943365 663 567412 904 496270 098 258672 439 173544 274 623808 022 551960 105 9611 809 523 63800510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %92 483 16699 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %92 215 46298.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %267 7040.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %46 731 24550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.5 %92 084 94498.5 %1.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.9 %3 602 1783.9 %96.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

10 322 11111 4109 52719 54911 15226 03925 09945 38637 206104 98481 16627 822175 70138 97733 255383 65470 942195 825138 43250 464270 7433 474172 654615 9923 38038 9634 6274 5954 1172 261 71910 1738 2789 77612 85212 75618 184389 2261 166 48034 4869 95455 56232 7325 26688 7347 34211 388242 41611 87219 52017 51439 36419 14860 94454 61685 804154 834356 45475 337 850051015202530354045505560Phred quality score10M20M30M40M50M60M70M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.71%99.65%99.83%99.42%99.81%99.85%99.7%99.73%99.48%99.45%99.81%99.87%99.83%99.83%99.63%99.62%99.73%99.7%99.8%99.83%99.65%99.75%94.87%99.66%0.29%0.35%0.17%0.58%0.19%0.15%0.3%0.27%0.52%0.55%0.19%0.13%0.17%0.17%0.37%0.38%0.27%0.3%0.2%0.17%0.35%0.25%5.13%0.34%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped