European Genome-Phenome Archive

File Quality

File InformationEGAF00005283997

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

390 131 291402 261 038156 549 403154 611 79867 477 68860 817 45330 339 51325 551 41214 526 19811 825 7027 740 5796 313 2254 673 0113 903 9103 189 6862 772 0332 415 1362 169 7321 965 9201 811 5881 681 3001 571 2791 489 9411 408 4111 338 2431 276 7081 224 3981 178 3651 131 7251 093 3381 049 6161 014 763987 007955 934928 489905 472878 759855 161835 356815 808793 317776 357760 651743 858729 302716 050701 285688 596675 233662 251653 551642 330633 614625 829612 559601 975593 686585 581578 532571 876565 484556 407551 547542 323536 212531 279524 731517 101512 610508 022502 420496 332490 638486 612481 687474 271470 726468 584463 112460 108453 842449 956445 475443 382441 560437 456433 254427 456424 934419 702417 708415 851411 348409 378403 481402 020398 445395 968392 757390 101388 007385 233381 072379 850375 660373 439371 244367 768364 318360 962358 185355 962353 001351 434348 334345 646343 321341 646338 056336 665333 660331 735330 183326 845324 722325 380318 898316 216314 849313 580310 506308 680307 301305 808303 892302 778301 412300 388296 746295 826293 934293 405290 398288 194286 846285 381284 009282 563279 481278 128277 963274 053272 513271 445270 747267 937267 646265 226264 726263 474261 001260 780258 691257 272256 274253 720253 487252 553250 265249 797248 545246 420243 964244 208242 327242 478240 276239 761238 313237 624236 317234 336232 749231 197230 290228 051227 959227 857225 848224 327222 649221 233221 135219 838219 107216 400214 254214 878212 325211 548210 236208 919208 575207 471206 130205 615202 639202 068201 185199 901198 609198 516196 512196 087193 815192 787192 791191 622189 994189 243187 288187 193185 356184 428182 494181 868180 251179 871179 753177 576176 112174 971173 095173 132171 306169 005169 394167 536167 434166 132164 820164 085162 328162 497162 063160 274158 331158 862157 651156 435154 845153 498153 442151 528151 059149 690148 023147 892146 917145 198144 706143 829142 822141 161140 123138 752137 905136 833136 473135 292134 323133 943132 292132 051129 338129 339128 612127 923125 884124 059122 976122 312121 404120 177119 268118 480117 582117 557116 113114 795113 456112 619112 328110 814110 670109 391108 012107 890107 021105 718104 852104 103103 527102 134101 765100 69998 94699 11398 40896 82097 01696 28394 82094 14493 26091 74091 15490 44989 53689 24787 96387 26086 72185 68884 37784 03383 02682 42981 66480 81879 57579 13379 08477 86877 13876 04975 09974 42674 41273 36373 14272 25471 20170 50170 33170 20669 04968 25567 62467 30066 16965 87765 08064 39963 92163 21062 25261 56260 89860 80360 22159 08858 72058 09257 33357 04456 34055 82255 97155 45054 60253 63153 61752 31851 83352 09751 24751 13450 15749 84349 21248 59647 85347 77746 94646 38345 92946 00345 05044 57543 69943 49142 51342 23041 44341 34340 98840 30340 09239 07638 55038 43537 96037 31337 00236 80036 18635 73235 67034 78134 39034 39734 29433 52432 90333 03132 50231 66931 37730 81930 60330 02729 67929 36729 32829 08628 34228 18427 89527 53627 64527 17326 88026 70826 22025 98325 68725 28925 19224 90224 73124 27923 91923 34023 19422 85222 72822 37722 20221 77621 31721 23320 84720 65520 11920 02719 77119 70919 45619 20918 94618 78218 46918 25317 90117 92917 63817 31117 02216 80916 56916 47216 15416 09415 67915 52515 26615 03814 85214 70714 56914 37214 20914 21513 87513 84513 58013 39513 14113 32512 91212 89712 66412 52412 36211 91811 84511 89911 62911 34411 16711 28010 96711 05910 95110 64610 57110 49510 30410 0979 9509 6659 5569 6019 4639 3339 3989 3209 1278 9919 1018 5548 5798 5798 4738 3888 3198 2378 2947 8477 9387 7417 7997 6517 4337 5337 1957 2206 9566 9716 7746 7726 7406 6346 5766 5986 5036 4346 2236 1036 0165 9275 9055 7385 8015 7315 5595 4275 3925 5315 4525 3195 2195 0325 0484 9544 8634 8334 8364 7254 7994 6564 5874 6734 5484 4764 2974 3174 2194 3654 1964 2144 0664 1684 0104 0224 0313 7993 8843 6293 7423 7003 5753 6233 4583 4103 5413 4273 3813 3093 2623 2543 2593 1333 1423 0543 0303 0252 8982 8112 8942 7622 8852 8252 8042 7932 7502 7022 6792 6922 6212 7212 6002 4542 5772 4452 4902 3302 3862 4062 3582 2432 1752 1782 2162 1792 0922 1862 0652 0582 0612 0802 0351 9651 9731 8411 8261 8431 8041 8181 7311 7381 7591 7501 7581 6071 5931 6351 5891 6351 5581 5531 5881 4901 5731 5191 4851 4801 4821 4781 4181 4301 4251 3631 3761 3851 3551 3941 3861 3791 2831 2981 2601 2771 2901 2181 2261 1681 2531 2211 1621 2101 1191 1721 1721 1741 1261 1101 1391 1271 1151 0681 0951 0591 033979969955950976934918930926918957908956882900894864845842880838812816774762736781728685651655723676651650628636626595634584620601703586619590584592595579565558538575510529535525457480496478498507512505471489425440465410422433433419459440425430415410411436413402398381425376399375390401433421368420419386400410373328397355367331353317321306329327310280298286295299307280304318290275288296297291284317294296340276277303283263281311285301262285276252258252259237247254212248232245252227231207218224205193205224214211197207216199188185207192191175214185197176164178172168178153176183177166184180178151158160176151177170156169171149159146152151131134149148134125145211142132151136115124130141119127131120105123141115124113991101191281001079911610012193104115115122132118108118104124126110106106127114117122129104111841021021039293917810210385124119107879596102951049187938425 364100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00383 99800000000320 158 1950000000000000590 567 9770000000000017 931 482 81400000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %124 784 043100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

100 %124 782 820100 %0 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %1 2230 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %62 392 69250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.1 %123 687 42699.1 %0.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

24.8 %30 998 30924.8 %75.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 699 24397 77341 715134 27041 58738 364175 52654 19433 79971 51322 03517 48266 76724 28910 20240 09313 64314 77631 45321 14715 57641 63226 75523 63343 40586 8157 263419 8297 8697 78714 97514 9188 60133 8347 0416 83612 50216 4205 16740 543323 61821 77717 22738 78323 30867 71479 172139 321538 86426 85432 92519 85141 0908 93416 50615 78812 668108 40715 43943 859119 670 706051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped