European Genome-Phenome Archive

File Quality

File InformationEGAF00003608599

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

52 610 39763 246 76468 282 90773 989 73163 765 36860 661 02345 776 91842 574 08330 487 39729 862 89521 269 70021 676 29515 639 91916 092 37111 797 29612 107 3189 118 6239 173 1877 134 5377 043 4415 678 4365 517 4274 549 8524 397 6443 709 0783 517 4643 059 2522 904 9882 570 5612 444 6272 195 1792 083 3921 908 3561 807 3421 683 1091 601 5531 499 2771 446 9141 370 6921 315 7251 249 1481 217 8591 175 7011 134 0161 096 4141 071 1801 044 3771 011 261988 214967 325949 078924 532912 631883 293873 962857 699843 933831 194815 031801 745789 151782 096769 912757 612743 230741 352731 703720 699711 500700 402694 193689 622676 424671 710666 444657 314652 404640 234635 176627 759619 400615 330606 704604 946595 670589 184582 087576 869574 811562 322557 518551 432546 726540 111538 323534 983529 716520 023515 745510 718507 576503 620497 791492 308488 481483 260477 656474 037468 647466 289462 414457 744453 922450 173443 466440 126436 708433 971431 119425 030422 390416 819413 431409 898406 152403 436400 521394 770389 177388 126385 107379 543378 459374 640373 742368 534363 462359 477358 312354 808352 204348 991348 145343 000339 547339 377334 244330 098329 303326 988324 602319 425320 237314 974311 419311 340308 593305 114301 938300 406299 557295 742292 208289 639285 060283 893283 187282 099276 612272 777273 260269 738269 678266 672264 578264 107260 207257 611256 884252 330252 939249 199248 068244 391244 273241 719240 171236 988234 866235 157231 698229 064228 006224 753223 888223 055221 088218 185219 033214 519213 271211 590209 615209 842206 646205 978204 483200 450200 461198 608195 686195 069193 177189 942189 144190 251188 068185 423183 356181 418180 404180 353178 787175 834175 523171 980171 266169 379168 140168 232166 281163 936162 746160 782160 618160 281157 597155 399154 703152 757151 553149 111150 219146 909145 425143 899142 961141 751141 246138 958137 104136 920135 754133 884134 231131 852129 371129 190129 454127 191125 843123 694123 361122 220121 389119 149119 698118 396116 778115 115114 239113 090113 993111 395110 204109 548107 536106 684106 393104 655104 198103 433102 010101 27099 58698 62697 15296 96996 22694 01993 80993 04892 68591 87190 90988 82189 30088 12286 60485 58385 33184 49282 83182 86181 48680 07480 61780 03678 26877 46976 46575 16674 94274 81573 87972 46671 94870 88670 23369 30168 99668 24967 15666 65066 65765 36165 12164 01263 83561 92961 91161 61260 66860 36859 56659 04658 52557 67757 44756 26954 82055 00954 83253 60552 86353 26252 11451 30450 61950 65750 44949 66748 81748 92848 08646 83046 26746 04945 53145 40844 65043 95243 85643 79142 60742 16641 34041 06640 48140 32639 51939 56938 89738 32237 73737 79236 79236 80336 49436 60336 01334 78034 51233 76933 89834 30833 43633 34432 55532 20431 72031 31930 64930 04830 06729 69329 40629 01528 28428 64627 82027 04226 74326 73027 18126 48925 75025 58025 26025 22624 40724 12224 22223 84223 63923 64022 44322 51322 32222 14921 64421 45421 04821 12020 76120 85720 42020 01819 47119 59419 20019 05018 69418 86718 19318 28218 27317 71717 48217 23017 38516 68316 48316 58716 60116 04415 80815 73215 78815 15815 13214 75314 67314 49814 08914 27213 81513 71013 53913 43813 51112 95812 59712 78012 79112 34412 49811 78812 05611 88711 85111 40811 40011 28111 01510 85710 96910 49510 79610 44610 35610 1019 8529 8469 8699 8159 6789 5949 5909 1609 2309 0168 8338 8488 7578 7438 0738 2518 0788 1448 0597 9437 9167 5667 5107 3617 4097 4747 0427 0986 9426 7666 9016 6686 5906 5346 4726 4946 0255 9666 0245 9935 9405 7895 7985 7115 6265 5615 5385 3095 3425 2215 2215 1675 0874 9864 9874 8154 7054 7194 7824 7364 7394 5964 5434 3114 4214 2284 2564 0704 2914 1844 1603 8343 9513 6753 7343 8283 7943 6603 5493 5023 5303 5513 4413 3523 2593 3073 3243 2783 1863 0983 1513 1653 0362 9232 9822 9862 9042 9322 7432 8362 7062 8132 5932 4542 6042 4962 5992 4542 5492 3732 4082 3752 3332 3162 2842 2732 1052 1542 0982 0312 0812 1282 0661 9961 9481 9581 9011 9552 0341 8421 8141 8071 8531 6881 6681 6741 6391 6091 7321 5861 6141 5371 5451 4881 5251 4651 4791 4621 3401 3381 3061 3751 2671 2821 2171 2291 2261 2301 1681 2371 1941 1911 1911 1441 1291 0871 0531 0269891 0771 0001 0291 0491 01793798397290693999796994199385691591590781791180278183579778277076974980873376277269576274163677074963166768263468763768361564768161163958359255960761656457258157454955555053454156252254054957050853352952047842847348746047047045949744244245745241642541039343440539245045938638235241237638836734041035235136033834835935534638437834934231835037034433329931231634636730630831830730834441128930932929428531030530429527828626725329428528425828126324524128623724828527026226126725225423524927525725624625426524224122926424221222824225323322224622625322724527321824319021721519620724021022122021819020820219923918219619020821020221320820616717819520720019619119219416719817916416814720017116715018218217913515416517216918428713615716717615019013917813412814014615918027014916415912817216514615715815815713314616116015616114114715115716214716418923313612213812814711414313513912310413413310513012512411714410712912013513011510513511212410712710511210511211912211512113411510712212310811213611312212313213210712112612013812313311711911911912056 050100200300400500600700800900>1000Coverage value1k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

17 738 2880000000000590 435 2840000000000000927 431 2900000000000018 717 070 51600000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %133 978 80299.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %133 864 44899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %114 3540.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %67 061 83950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.1 %132 892 36499.1 %0.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

78.1 %104 699 93278.1 %21.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 484 43323 94013 09941 75716 47819 00724 72438 98825 26944 73212 06010 80817 50117 0248 70336 69711 44713 10017 16320 62517 21539 30232 08027 74752 317124 0315 685566 9056 9416 32821 56015 5468 07435 3806 8807 12413 86519 2644 90353 018701 02130 54520 80349 44330 16283 54987 275168 058477 51423 42441 72032 37952 90816 88329 11830 95123 200149 47223 74158 606128 443 476051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.91%99.92%99.93%99.93%99.92%99.91%99.92%99.91%99.92%99.91%99.92%99.92%99.92%99.92%99.91%99.91%99.92%99.9%99.91%99.92%99.91%99.91%99.78%0.08%0.09%0.08%0.07%0.07%0.08%0.09%0.08%0.09%0.08%0.09%0.08%0.08%0.08%0.08%0.09%0.09%0.08%0.1%0.09%0.08%0.09%0.09%0.22%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped