European Genome-Phenome Archive

File Quality

File InformationEGAF00000644153

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

144 359 60228 338 1728 789 3336 052 4604 669 1953 998 0073 547 3453 200 1712 942 0772 733 9672 557 9142 400 9052 275 8842 154 0962 043 8811 955 8261 868 3011 787 6291 711 4691 640 7251 574 1731 510 7931 451 0121 396 1841 336 9061 284 7741 231 6461 182 0201 131 6911 081 7411 036 768994 264955 608911 155870 166833 390796 034761 548727 369696 209665 764635 567608 517579 829555 997530 334506 805484 296464 047442 064424 208405 085388 527371 755356 893341 453328 317315 019301 683291 306278 535269 482258 027248 173240 194230 688221 062213 443205 894197 379189 731182 164176 155169 941163 111156 099151 467147 644142 324137 277131 915127 752122 854118 238115 056111 775107 909104 094101 09199 07095 10291 65489 13187 23185 23582 78880 06977 19775 59672 85770 80868 76066 74464 80663 07161 99059 96758 20856 47954 41353 23852 56351 19049 93948 24646 66445 66444 17643 29442 20841 01239 43538 66637 33936 79935 70534 72433 65132 94232 15131 59930 92030 19729 41328 53028 00627 48126 78826 08025 91225 24724 64124 30823 54522 95222 08121 54321 30321 13120 36119 67719 18118 61618 38218 11317 73217 29116 92316 32416 29015 80415 59815 08714 97614 61614 11714 18213 84913 47313 00712 81412 58412 17911 94211 83311 57711 36611 47811 07310 87510 53510 52110 08810 02310 0069 7319 3839 1308 8358 6838 6708 4998 3998 4058 2557 9737 9157 7967 7217 4647 4357 1527 1246 9406 7276 5276 3916 3896 2516 1145 9685 8055 8015 7095 5515 5185 3985 4695 4275 3495 1194 9604 9984 8354 6504 7744 5924 3694 4444 2784 2924 2934 1334 1803 8903 8943 8663 8383 7053 6143 6523 6323 4243 4343 2033 2023 0593 0663 0852 8942 8682 7142 6982 5502 7052 5612 5642 6682 4982 4692 4372 3672 4352 2462 2622 3152 2742 1932 1482 2902 1012 0892 0622 1082 0592 0442 1191 9371 9811 9601 9651 8491 8091 8201 7441 6231 6311 6281 6191 5551 5781 5771 5721 5351 5121 4751 4461 4901 4161 4101 3671 4171 3401 3471 2941 3261 2521 1841 2621 2321 2331 2501 2091 2821 1981 1931 1931 1301 1821 0591 1481 0761 0571 0661 0879961 0701 0171 0359889941 0079579559749719749399739789719188949329218938718848288318348708588037548098638147357137337437897817727567016996917366776897196706646506616556215616575946446386305826076025735555355524885455025004995325195034755185135184804814895345014604724685154684844444394604594714864744734324414544434194464254623814164313903453763684023643823963903483763703753803443493353493033123033063013063283063262943163122952873082672782832702552642302512522412762522312602612472542502572302482332202612892532652432362272462392322162692302302352372202072521842262021812002342062001932022151932031911861971871771981671681691361591551601621781541401501381281401311131241191231341301311201191431271231171251231211381241531301261231421511311201131261241301121039990991121079986949295951089610499102112788774868688859582969480107107878697107971058783929888991068466909996868886717980807997798484879310087891019789949488949176968370771049694103978511395113100911097587100756192929598809376869581881017790909110353839484758389868988758977667772736476977482827275668362507775638197856877676075758676626253525838636661654756476556596556655449645063615162596256505348525456515750506268475355564353655140525146436352464956535448716464794854665858486271526262695757567261657359817272728185896369596370765674687464538352697166677279716066707655656968798260608087748681699582686391656074717975697483737875698568707766636578767078745558796681768564708263608169819079848775837967706875707177727985857173637176866884766370587478636969636863546658575547656351677244685759585466495051554651414949374942424743505147345336474453574946595255564754449 169100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 077 758641 9091 033 371984 393842 762637 3633 184 3301 961 166741 4189 148 1823 551 6394 189 2073 335 8812 434 3421 464 7161 190 3102 359 0611 947 94720 876 77411 889 9865 382 2856 937 73110 485 1626 786 60410 780 04325 417 61615 310 73318 578 71215 558 66514 389 00022 207 62427 171 53345 787 38143 020 96048 805 85195 949 03481 903 521151 482 332152 246 929198 780 688299 682 488409 792 106342 105 733284 193 481198 687 687110 502 55723 080 64219 432 6555 808 0192 163 92704 017 786005101520253035404550Phred quality score0M50M100M150M200M250M300M350M400M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %36 703 12899.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %36 610 01099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %93 1180.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %18 466 28050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99 %36 564 73099 %1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.4 %1 623 9234.4 %95.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 481 9373 3353 8756 1774 58910 3638 30318 10114 59237 23834 1329 60270 02216 20313 091152 38328 77864 47458 15321 342112 7251 20970 942249 1981 25411 2171 7381 4391 699855 4654 0723 3484 2345 5945 6327 960154 813450 55514 9983 88425 92414 0421 65436 8381 7382 444100 8802 7684 9484 87611 5246 43421 26420 16633 49259 710143 30229 421 890051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M28M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.73%99.77%99.86%99.62%99.78%99.85%99.66%99.71%99.58%99.67%99.85%99.86%99.86%99.84%99.63%99.65%99.73%99.79%99.83%99.85%99.75%99.74%96.65%99.52%0.27%0.23%0.14%0.38%0.22%0.15%0.34%0.29%0.42%0.33%0.15%0.14%0.14%0.16%0.37%0.35%0.27%0.21%0.17%0.15%0.25%0.26%3.35%0.48%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped