European Genome-Phenome Archive

File Quality

File InformationEGAF00004855891

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 778 4599 594 36018 160 02630 840 11448 266 56071 205 231100 346 334135 455 260173 998 760211 146 644241 344 146259 853 647263 993 975253 546 951231 018 076200 222 933165 839 729131 530 721100 185 81773 485 26752 117 99935 834 51924 016 14715 732 10810 148 7786 525 0404 236 4422 814 1771 939 6091 409 6671 084 965878 958749 715651 084580 967526 549482 107440 821407 905373 305343 996318 009294 488271 548252 991236 929221 645208 489195 519183 299173 473164 203155 847147 611139 676134 779128 185123 369118 820114 548110 461105 915102 35498 65295 95793 09490 75488 77886 63684 48882 14678 78577 72074 61772 15371 32768 72966 69464 13561 46759 37557 17454 65153 26151 66749 93048 71946 70044 79743 11142 26140 62239 44737 63736 39434 98534 16832 89932 21031 05930 05728 68528 20727 06526 17225 08824 81624 36023 74422 97822 52922 25922 04121 09820 61519 76119 39218 88218 20218 03917 87017 65017 43517 17016 73816 00615 85715 59615 49715 00714 49814 73114 22614 19414 22013 68913 44213 42413 11312 81912 45812 34812 13411 74611 72611 61311 33411 52411 40611 19810 88311 08410 80910 52110 16410 01410 0209 8849 4959 8409 8739 4439 2209 1259 1158 9298 7648 6748 7548 7148 5358 3128 2478 2238 0727 9487 7567 8947 7207 7757 6437 6317 4617 2597 2437 3097 0877 1297 1247 1577 0586 9526 9166 7426 6676 7486 5706 4476 2486 4316 4086 3616 4136 2805 9626 1616 0345 9885 8825 5955 8505 6235 6075 6075 4645 3155 3245 2605 2815 2525 3054 9694 9645 0685 0625 1925 0164 9915 2915 3145 0735 1364 8944 9234 7644 8304 8414 8064 6544 7974 7044 5384 4794 1894 1804 3564 2324 2694 3344 3344 3264 2154 3034 2264 3374 0694 0854 1063 9663 9513 9033 9503 9193 8183 8483 7643 7683 9123 7443 6783 6153 6083 4983 6433 5083 5233 4673 3913 5593 3263 4633 3263 3893 4283 3783 2583 3103 3963 3423 1863 3293 1573 2773 1913 1533 0953 1413 1253 0873 1763 1123 1363 1543 1563 1732 9613 0832 8712 9022 8112 8322 8662 7592 7412 7052 7882 8172 6942 7242 7292 7352 7502 7442 6512 6972 6842 6072 5692 4342 5402 6532 5802 5052 5392 5252 5172 4322 4092 3722 2832 3932 3602 4192 3122 3472 3302 3102 3202 3232 4102 3212 2992 3372 2552 3012 3002 2222 2662 2542 1502 1262 1392 2422 1942 1182 1932 1692 1912 1672 1412 0932 0772 0502 0032 0592 1082 0261 9101 9261 9411 8861 9001 9241 9032 0741 9252 0241 9331 9421 8801 8931 7971 8011 8741 9181 7981 8051 7871 8101 8161 7391 7531 7461 6651 6681 7531 6371 6481 7071 6791 6431 7011 6991 7121 7071 7401 6341 6151 6181 6561 6161 5771 6301 5851 5681 5851 5351 5861 5631 5441 6001 5071 5411 5751 5631 5521 5561 5511 5541 5051 5821 5211 4711 4721 4831 4631 3801 4191 4851 4341 4211 4081 3821 3601 3691 4191 3711 3961 3121 3531 3451 3921 2561 3371 3981 2901 3081 3751 3311 2441 2921 2561 2661 2291 2191 2951 2231 2711 1881 2461 2211 1971 2691 2041 2371 2521 2051 2671 2471 2591 2751 2421 2251 1371 2041 1861 1061 2101 1621 1991 1361 1771 1221 0771 1531 0791 1461 1961 1491 0911 1631 1411 1041 1351 0811 0851 0671 0611 0981 1111 0771 0981 0691 0701 0611 1411 1241 1451 0191 0731 0361 0601 1091 0801 0621 0961 0171 0621 0481 0511 0691 0539881 0511 0461 0471 0121 0301 0281 0111 0539059811 0011 0091 0031 002982921986997953999915927966959915930848974944871915921925954940947908948930970982944924909946906955898854889882877908875870908854853824817904851801875793782789806795767785885835803797807738733732784708767694742722747702707723641717692749702679699719666713746734690699718722673643702666706673673648646634689622718689732672651688639671623662643653736695631624624589652665660648635622655660611658640581641620571539636587578591559601600641559600627597611610612633592620594533553558593604603585596588591574597564543602587616552614567613610545579561574620613599563606591652578613608591615562579633598613620590560621593595546551585589609587629518572588603573634622576614573568601582633594598588588646641619632624625586581609618567618677648632636631655662618597635711654626671647688628605630643645629655653648622642607609621619622671649694643646641590598635631599626632643611585625577608603581638657645667638620592591564589619599540565569587596580599572550572593561621574604530568559536538521523538529588562592561546553558541523525580529553495549540502494543477528464481491471496502497504482493511480495467483488504483470522449449504493460443460442452454440408414396430400425476423417386411399381413414426391420439424397401419399387383422398418378384432459417420390430411366422427367407393419448416398390381380395356387380387389398366411469 186100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 415 17800000008 217 0720001 920 690 7620000000001 114 374 65300001 129 492 96500002 428 006 21400005 153 544 67400028 988 197 01200510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G22G24G26G28G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %269 052 57099.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %268 344 26899.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %708 3020.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %134 917 01550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.7 %263 608 50697.7 %2.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.4 %17 333 1196.4 %93.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

12 085 222289 261176 210334 080254 400264 377304 402410 901177 651301 325143 554124 524176 454197 234103 701228 511170 401189 324243 778321 370339 055315 491361 255281 991452 656739 71353 6281 313 35079 35676 726156 475154 83866 987181 12476 16174 746126 924167 71044 558248 9393 731 647163 424158 013254 399218 209394 348342 284522 562808 223104 899132 762124 515148 83474 850143 091139 116101 064360 646110 491209 860240 678 035051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M220M240M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.73%99.72%99.72%99.72%99.72%99.72%99.73%99.72%99.72%99.72%99.72%99.73%99.71%99.72%99.72%99.77%99.75%99.71%99.78%99.73%99.72%99.73%99.83%99.68%0.27%0.28%0.28%0.28%0.28%0.28%0.27%0.28%0.28%0.28%0.28%0.27%0.29%0.28%0.28%0.23%0.25%0.29%0.22%0.27%0.28%0.27%0.17%0.32%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped