European Genome-Phenome Archive

File Quality

File InformationEGAF00006165067

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 859 8705 323 9206 344 5938 384 12211 190 39615 164 38120 218 49926 597 44834 142 70342 786 33152 432 67062 582 15073 160 82083 761 17393 930 880103 390 691111 611 133118 542 167123 782 722127 427 114129 188 478129 289 300127 753 466124 577 042120 214 781114 564 227108 056 255101 020 89793 455 70585 604 34677 761 82370 104 39962 680 25955 635 92749 029 73742 881 87037 233 73732 150 68427 567 68923 552 78020 021 65816 919 18114 210 00911 908 1489 962 9148 274 9266 865 5855 696 2804 698 9813 878 4793 205 7262 649 4302 196 2471 820 8831 526 9751 273 1351 077 351914 560777 645672 917585 510515 284458 285410 372372 665338 496312 358287 294269 162250 501236 178221 954211 183200 249190 515181 830174 141166 834160 045154 239148 771142 260137 219132 977128 346124 721120 678116 746111 550109 587105 004102 26099 63898 28194 21291 91789 16386 06184 42382 38280 62279 29077 73175 95674 56472 60872 12469 59068 28166 62565 71564 92663 34061 89260 36859 07157 65156 98256 13654 52253 61952 82852 63550 56350 92749 95648 42947 61146 52344 79543 75543 13642 16340 52339 59638 94738 39336 86036 07635 73834 25533 73433 19732 46631 74330 69629 84229 09628 17627 27927 44926 77725 74825 39125 17624 28723 88622 78222 54621 82921 27120 77620 74019 97519 91319 44618 79718 47417 79517 66617 36416 96916 81316 11115 72315 57815 31615 37714 57914 57214 08113 72613 53813 22913 06813 19512 84112 70012 31912 40912 22011 99411 78411 81111 46711 56811 54211 20911 08811 03710 92810 76410 69210 40010 61210 31910 0539 8749 7369 5129 5159 6609 5179 4319 4148 8879 1028 9158 8728 8328 6708 4678 1918 1728 0867 5727 8907 8067 4987 6497 7547 6397 6647 4417 4697 2567 1307 0917 2237 1137 0567 1476 6706 9096 6356 5296 4196 4176 2586 2736 2136 3876 4806 2466 1166 0375 8506 0475 8486 1895 7755 8985 5825 7885 6655 4955 3645 5585 3455 5265 1945 1785 1165 0495 1045 0424 9244 9074 8644 6664 6354 5984 5894 5204 3574 3554 4424 4614 3324 2894 1514 2714 2624 2164 3664 2124 0964 3864 2614 3454 1514 1564 2704 1894 1944 1054 0103 8774 0354 0503 9833 6063 6593 6683 6343 5693 6513 6263 3913 3473 4503 3873 2843 4083 1663 3673 1343 3173 2913 2193 2153 0763 1963 2283 1693 1133 1303 1353 0063 0432 9823 1713 1083 1493 0923 1002 8922 9293 1273 0342 9902 9342 9752 9792 9442 8492 9572 8362 8632 8002 9142 8202 7852 7182 6912 7062 7862 5992 8062 8202 7252 5902 6842 5842 6102 5782 6252 7012 5762 5882 6642 6372 5502 5472 6242 5122 5432 5362 4912 3952 3242 4572 2362 3912 3422 3282 2962 2782 1532 2572 2492 2582 2592 1552 1672 2452 2082 1842 2102 2212 1932 1922 0912 0742 0932 2542 2312 0862 1662 1192 0432 1032 1792 0142 0922 1382 1602 0592 0522 0632 0211 9462 0621 9321 9201 9592 0041 9321 8371 8231 8591 8951 8371 8621 9291 9471 7781 7991 9261 9481 8531 9041 7371 8411 7731 7701 7351 7211 8361 7111 7861 7721 7691 7541 5991 6831 7471 6501 5361 6321 5851 6241 5611 5761 5751 6301 5691 5941 6191 5971 5231 5561 6071 5651 4441 5581 4021 5251 5631 4351 5001 5341 4291 3771 5271 5201 4291 4551 4871 4751 4001 3551 4871 4521 3331 3251 3841 3921 3481 3811 3471 2811 3101 3531 3171 2831 2741 3101 3401 3371 3121 3211 2701 3541 3731 3401 4051 3421 3781 3281 3581 3241 3221 2811 3911 3101 3481 3641 3901 3461 3381 3081 3131 3491 3351 2711 2581 2841 3361 2671 3431 2521 3001 2051 2341 2281 2151 2341 2621 1811 1801 1811 2081 2761 2181 2191 2281 1511 1391 2501 2311 2991 1951 2391 1961 2731 1131 1341 2081 1401 2021 1861 2081 1031 1291 1401 0391 1671 2061 1331 1891 2141 1661 1701 0891 1101 1211 1511 0641 0831 0971 1081 1031 1061 1151 0791 0381 0681 0951 0811 0191 1091 0861 0659919631 0561 0289661 0089901 0491 0261 0069869799329629469921 059969910946968947925940981888901954977891859928971956871867896846839931905798900837863784860836899878893844929893807864831848840829867845858843871814854868850799828840806741822780832811805758797820772712751745785824708782722784781758778768743777733767762738765754714730725661747726732750744775720779710674763699714724721730752722757722680674715719729682672673705674685648664674704714645648629648624651657636614688656633671668630642626650614660631659615612569630611591654635646660614583605628608619611642591594665579634634619628582616571644652649671652633573602567568598556569553562566631590561599612523555524568567577522544557599559520545552596629543517490524574535533514512568546516591528535484481508492493538580515522523538495517500471508499459468477511445500507442429484485469452461465447459484456512472452489568492515473486524500473495455471458434512462467450461413469433423410458470475464448512437474418421417450453511451443445450441438473450456485496478446432446491428448447463445472429450409425421413430439434426450432465449468476414499442479447374432428427428481380401416385399415387396381395381449412591 494100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 158 25200000000002 586 777 57900000000000003 754 090 5990000000000065 158 012 86400000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G50G55G60G65G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %472 385 92399.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %471 993 16099.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %392 7630.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %236 755 09750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.6 %462 208 03697.6 %2.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

51.6 %244 318 68951.6 %48.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

20 482 523464 682291 8461 006 137399 744409 242479 784630 039304 026480 254214 219241 220257 099316 985216 332360 410234 933263 868354 144492 866513 413502 338615 276485 497780 3631 297 71881 3862 338 516117 809110 626219 822236 780112 895288 714118 753120 433189 578263 09069 584396 2955 773 287263 919243 699432 213359 205669 812587 754914 6881 516 945162 355223 146187 655254 105113 677212 125214 957151 203611 606150 809325 192427 860 802051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.89%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.91%99.91%99.92%99.9%99.91%99.92%99.92%99.67%99.85%0.08%0.11%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.09%0.09%0.08%0.1%0.09%0.08%0.08%0.33%0.15%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped