European Genome-Phenome Archive

File Quality

File InformationEGAF00000644698

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

218 903 25146 015 50312 241 1847 001 0554 792 4823 937 1373 396 9933 035 7882 757 4382 536 5282 356 6482 209 6112 080 9781 972 3121 873 0021 789 7851 708 3561 640 8281 580 4441 522 5521 470 5651 424 7161 377 3211 335 3741 296 3141 257 4111 220 2851 185 9681 155 5781 124 5151 094 2361 065 4321 034 5871 006 423981 204959 831933 452908 284881 879855 859837 080811 868789 286767 761745 469726 918705 890685 446668 409649 489631 441615 884597 638581 923565 179549 232535 551520 058505 210490 039476 937463 448450 545438 612424 694412 734400 417389 365378 664366 991356 323346 977337 657328 307320 128310 391301 920294 721286 800278 185270 957263 364255 794249 178243 373234 742230 050224 335217 356212 980206 317201 670196 142190 218186 070181 251176 087171 598167 496162 955160 050154 924150 639147 223142 810139 453135 095131 791128 614125 488122 484120 382116 835114 820112 184108 501106 335103 895102 23199 09296 78795 27492 61789 67087 84686 51884 79682 68181 02478 76376 87375 37574 30072 23770 64168 69967 53965 90764 70962 65761 30060 32759 19757 76857 04955 75355 55253 73953 03051 80550 52349 28648 63947 42146 17244 76043 93343 57142 30441 86840 90740 38139 32238 71838 21937 12036 55436 22835 62934 68134 34133 39932 93232 00231 54830 75830 17129 91929 26328 60928 19927 88726 93626 71225 68825 44425 05624 45523 99823 60223 16222 80522 06221 68721 28921 01320 47920 48320 03719 95319 10818 81218 80518 29518 02717 50117 44217 14416 98516 71116 29016 16616 02615 84315 46814 98614 98314 52314 73614 09813 87113 84413 71313 21313 31212 94312 83212 60212 62312 40012 31712 01011 59011 66211 41211 31811 14611 16010 80010 78310 67710 42910 20310 10910 1389 9799 8559 6809 3029 1839 1448 8998 8038 6448 4928 3818 2158 1608 1298 1387 7377 5747 4647 6677 1237 0887 1566 9736 8766 8066 8436 5196 6476 6206 5436 3436 2656 2906 1646 0465 9195 9845 8775 8255 8795 5605 5025 4355 3265 2955 2905 2225 0775 1424 9354 9744 8654 9484 6714 6844 6484 5924 5424 5694 2794 5294 3834 3054 2844 2364 1044 1273 9404 0383 9144 0383 9073 7933 7413 6743 7143 6573 5483 6013 5843 5943 4283 3603 3053 2923 1663 0973 1963 1443 1213 0073 0122 8922 9322 9032 8612 8292 8662 8482 7752 7942 7062 7002 6042 5832 6102 5472 4522 4642 4522 5452 3832 3422 3112 3242 2162 2002 3142 1932 2102 2722 2222 1592 1442 1392 0802 0662 0492 1432 0632 0431 9521 9831 9391 8791 8711 7771 8691 7061 6771 7611 8311 7861 7871 7221 7191 6701 7241 6701 6211 6481 6491 5681 5821 5741 5591 5791 4971 5621 5161 5421 5151 5041 4951 5021 5361 4161 4201 4331 4021 4021 4331 4561 4051 3461 3721 4381 3911 3381 3711 3011 3101 3311 2711 3421 3431 3241 3021 3591 2621 2511 2211 2781 2461 2431 1871 2031 0921 1631 1131 0811 1101 1711 0481 0741 0721 0811 0511 0361 0951 0431 0121 0351 0331 0081 0099889971 0459409729611 0381 061934941925955917910926893916915876880822854878862859778840816843764821767778825785810709808747730690710742732674705749708717687708688668733655672623658652617658612631598602558604573614579588597608584579569547571547558603578546563549581532552505546536540501508512509495491452503518475471446464399469421462436431472418435383442401361394353374366348381348380367374321357382369368388366382352345373334327354317357326310338325306333325323352347359340301308342307307294329309297304322303302325295278288268279253263298271233263263231228229249245264253253234253247262236214277240236214228256229213214229217232215243209204228228216213225195211199205210207200208200202201194207207205197180199228211211181194198176191197182199209202186191181176197196204190179178150159195182161159194148206177190212214169194214189165175182175164162155142159161134144146145138169155169146155136149153153153151148128157130129153118141158124132114141132114127117120139103107124132126116129119143116119127971061151151211021321191251051311241391111151061141101401111351171141221219411613413412610710312410910710911112593951229987105106991101191101031091221061049811097128110811039990101901031171031078211197118929610810099104881049986699885921137910383986664846665697475617574648061795362736475634961617285636980857568736586737173665990766280746178797758807464638365596673705763547172686354796475546468624860584957525749575765628256595051495756505849645547605154605674503354653853425924 471100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

210 760103 534198 998166 394754 164257 9241 196 7233 931 2982 367 9304 255 29211 223 2904 149 7032 060 6204 868 9244 354 1512 771 7011 565 8583 562 55112 595 27814 360 27821 177 92311 380 9537 860 4378 659 93712 578 50517 680 54212 024 66865 665 66127 090 01034 729 95251 543 00776 373 827104 183 685130 874 914204 668 423270 954 613362 936 213634 146 9621 030 682 335899 112 538242 320 83751 357 42917 590 420741 5571 041 4310051015202530354045Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %57 931 53999.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %57 783 80899.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %147 7310.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %29 148 88150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %57 639 85498.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.4 %3 719 3246.4 %93.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 283 8824 6794 5819 4205 69612 88510 72326 36020 61756 88745 82914 691104 32221 13019 367238 91542 15994 33483 16932 398169 0401 834114 273396 0561 61316 5752 3922 3072 2261 282 8975 7924 8305 9147 8587 66813 664243 051770 85220 9265 29437 39819 1782 67253 6362 3104 164158 0423 8047 0407 13016 5889 52830 02829 39046 04688 160212 04047 363 502051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M45M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped