European Genome-Phenome Archive

File Quality

File InformationEGAF00001414897

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 401 5811 707 0551 442 9681 323 7021 254 1871 206 3761 173 3861 172 0961 187 1881 248 3291 325 8321 464 3011 654 9301 905 6242 275 1782 788 3273 460 7814 386 0115 590 4037 128 6509 071 17511 485 37614 373 45417 792 32221 722 28026 227 08031 231 90736 654 56842 528 15648 687 09154 984 04461 424 16367 736 88273 844 93679 635 38484 933 74989 648 98293 713 65897 016 24999 494 905101 141 119101 871 428101 751 392100 839 30399 112 89796 706 83093 535 72989 889 67385 720 18881 215 62876 382 25771 438 50066 323 81561 189 06356 136 13551 165 30646 388 00041 793 95937 466 54833 374 23529 623 76926 146 35222 966 10220 095 13817 465 41615 133 91313 062 45211 226 4989 626 4878 212 2826 992 7455 944 8645 037 7744 258 3083 598 8793 041 8132 576 5482 181 5591 851 3011 574 6751 346 3311 153 858997 043865 326753 918662 956588 957527 502476 278434 403398 480369 791342 913321 726303 964287 343274 005258 501246 266237 780228 108219 850211 337204 332199 075191 537184 439178 736171 010164 577160 024156 799151 513145 646142 060138 154133 615130 148126 740123 051120 023117 117114 350111 841108 182105 942103 770101 04498 69796 63194 74992 72291 09789 31987 54986 25084 66682 72580 79479 21578 44177 19474 71074 09773 06671 30370 54869 19068 05867 02765 60863 98462 45961 71160 96160 33359 33559 56057 86456 28754 88454 12253 57152 59151 92951 17949 93048 83448 43247 76647 42946 03445 96845 13244 07643 67842 68442 17041 32640 63240 30439 53038 96838 83737 98437 16436 78136 22835 64635 02734 71634 26933 55933 12932 46932 11031 63631 29031 48530 89230 24329 38529 45628 98128 52027 95127 55227 35827 06726 45226 46025 87125 64624 85424 68824 52323 73823 39923 49322 82922 29022 31722 12921 70721 64821 27221 83721 81221 66221 03120 99320 71920 64720 39420 08219 96019 56719 48819 07719 21818 81318 39618 30118 07018 23717 87117 61217 29316 75716 81616 62216 39616 53016 33416 04015 90015 86515 67715 59415 51915 00914 94315 06015 11214 61414 39914 42114 10213 64413 53013 42013 23213 21412 76812 83312 79212 65712 78512 44012 41611 98512 09411 77611 94711 96211 93111 78811 78911 37411 15311 25911 17510 73510 98810 57210 82810 67610 43610 25810 3069 98810 0459 9919 8489 9639 8499 6549 6089 7759 4249 2289 3079 0819 4049 3749 1299 0538 9778 6888 6468 6108 4038 2708 1868 2588 0377 8707 7147 8727 6137 6457 4897 4457 4337 3737 3777 2857 2767 1127 0786 9986 9476 9736 7796 7736 8226 6436 6216 6326 5876 5176 6226 5066 4466 3526 3886 1906 2026 0386 1206 1126 0425 8345 8435 8025 8265 7845 8805 6705 5925 6865 5235 4535 3655 4075 4265 1915 3305 4425 2535 1045 1925 2695 1314 9065 0215 0724 9314 9484 8584 8754 9974 8594 8985 0464 7384 7894 7044 6494 8334 6994 6944 6324 6354 6324 4764 5384 5264 4854 4354 5534 4514 5324 3424 2264 3044 2764 2394 1304 1944 1834 0924 0913 9283 9843 9863 7573 7603 8373 8573 8543 7873 7453 8323 7583 6063 7913 7123 6813 5843 6673 6673 4873 5903 5433 5353 5443 6583 4683 5073 5203 5323 3983 3713 4693 4833 2933 2813 3473 2243 1843 2723 2363 2223 1693 1063 0663 1173 1453 1153 1103 0983 1123 0262 9072 9552 9543 0213 0423 0192 9302 7752 8302 8302 6752 6782 6642 5872 7222 6802 6142 6292 5332 6192 5822 3742 4602 4762 5272 4752 4582 3592 3322 4832 4082 4512 4262 4412 3702 3012 3392 3022 3182 2922 2602 3272 2792 2782 2542 3032 2852 2952 2602 3012 2572 2182 1262 1482 2292 1932 2742 1622 0922 1962 1542 0532 0282 0161 9392 0132 0112 0362 0341 9971 9601 9401 8791 9541 9422 0091 8191 8761 8811 8531 9351 8411 7371 7171 8231 7581 8141 7401 7861 7061 7941 7491 7221 6801 7591 5871 6201 5931 6311 6871 7051 5811 5701 5661 6201 5681 6471 5731 5001 4971 4401 5391 5621 4781 4891 4151 4401 3821 3891 3861 3911 4331 5011 4511 3361 4531 3991 3281 4331 3791 3151 4241 3831 3011 3381 3221 3041 3391 3281 3961 3611 3251 2991 3901 3831 3131 3231 2531 3401 2821 2921 2571 3021 2521 2521 2511 2451 2501 3431 2171 2391 2081 2121 2771 1931 3041 2381 1811 2191 1671 1591 2361 1911 1491 1561 2381 1611 1911 1461 1141 1391 1401 0811 1311 1101 0261 0281 1061 1211 1051 0981 0311 0781 0931 0491 1371 1401 2301 1021 0571 0351 1561 1231 0931 1311 0801 0911 0921 0601 0161 0261 0131 1681 0411 0309799199971 0071 0749641 0051 0261 0921 0601 0661 0619851 0081 0169679859399621 022997950991957989964978995994923989934959931909958921851873850866855851912859895820867888908847837836831828811865874870830905811872837823810848796820823797733797799750788729800795852826755858844789787780795810818753775792796832771774868842817795805785765786808812768753796751777774731774746776775710759761765730706790776714695739682663711695713682660749768750737750747741708761736727770617677703715686699702761687711683719696673712659661607643670619642632579681593597642668613587594594646592601646610536562626581598577565561595557579552564538555628557600570583580605591564563606570551566544527533556524539525546597560557548540556551574549601567570555582567584526498549546588592543561577511486534562530539559539558584547524550556531528590560544517539542568539506544525506517514505523458532517510523480485493516523514499513488489489459461497539492525479453499459452488475452449589 308100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

11 881 564000000454 703 3920005 352 151 85500000000002 798 509 80100003 242 410 85200008 351 238 427000014 927 802 8840000101 162 187 62500510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

95.9 %871 144 15395.9 %4.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94.4 %858 114 63894.4 %5.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.5 %13 029 5151.5 %98.5 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %454 336 28850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

93.2 %846 451 27893.2 %6.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

27.4 %249 134 35627.4 %72.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

84 004 0071 605 95250 5824 423 16525 68617 74867 27113 1797 9239 03350 01914 74415 5594 9886 14694 09512 0325 4024 3865 50523 60716 2575 1069 9085 50116 12367 2656 33612 6955 98825 76288 2908 06526 1797 51520 309194 21315 20556 09529 01265 723361 56954 128310 8396 7349 6002 038 93321 26612 3386 0226 72454 07824 31112 7605 8925 62139 04531 65915 9007 561814 622 748051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

98.88%98.55%98.98%98.59%98.93%98.98%98.42%98.46%98.09%97.58%98.32%98.78%99.13%98.83%98.96%98.31%98.57%97.64%97.82%98.57%97.82%98.62%78.96%99.2%1.12%1.45%1.02%1.41%1.07%1.02%1.58%1.54%1.91%2.42%1.68%1.22%0.87%1.17%1.04%1.69%1.43%2.36%2.18%1.43%2.18%1.38%21.04%0.8%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped