European Genome-Phenome Archive

File Quality

File InformationEGAF00007836675

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

976 642830 325801 702846 292932 8611 077 9181 278 8531 585 4251 992 9232 534 4753 226 3814 037 2924 977 5506 017 8697 143 9878 326 5089 493 28910 661 88811 788 34212 855 41813 916 60214 982 42716 135 36317 386 00218 868 75320 601 11222 747 93425 287 70628 330 69331 875 95435 924 29140 495 10345 472 03350 865 31556 444 58862 231 43468 043 01473 685 94579 049 63784 052 08688 482 50892 253 54495 332 38997 563 27398 883 63899 399 04299 023 89797 848 39395 822 53193 120 38789 733 82585 832 29081 487 56576 815 25471 845 67366 798 67361 643 28256 527 84651 532 05646 679 16542 046 95437 677 27833 577 07429 731 84626 236 72323 024 00720 106 80617 492 46415 171 49913 088 93411 263 1319 664 7468 272 9777 069 1566 029 6855 138 1274 372 7883 713 0533 155 3312 699 0422 296 0421 971 5071 688 7281 461 3711 268 2111 108 467976 157861 812766 937693 093628 020569 799523 368481 671445 817416 488393 899371 299351 037333 250317 992303 886293 303280 968270 455258 815249 486241 768231 408223 912217 437209 619201 753195 802190 139183 522178 493172 894166 461163 320158 072152 732149 484145 090140 492136 988133 656128 988125 771122 179119 091114 863112 085109 220106 786104 524101 86299 12296 03594 73792 23290 09787 65186 46384 51182 69780 55577 95676 97774 57673 47172 16570 43268 71967 83965 72564 89963 87162 80162 21460 45460 12659 57158 50257 15256 65056 07854 25354 39353 12352 26550 88250 15649 21148 61048 03747 03746 32245 37745 31843 82743 79143 22442 92441 88541 61140 97040 89640 13439 73338 92038 66937 86137 39236 80036 74335 88035 70635 57734 46534 86434 41133 90033 77733 37533 01532 79932 87732 43331 92031 78131 56931 25931 11731 23430 96130 57630 37729 79130 03029 46529 19529 27129 15028 59528 68428 74328 05828 28827 71927 63127 28527 00426 87626 91027 21227 01426 53626 76026 17626 19926 03025 52225 70324 97424 72924 90824 64224 31424 28623 95123 69223 85723 56522 90922 60822 52022 23822 18121 85521 65721 34120 86820 71220 41220 42920 00019 88719 40319 48819 13719 28318 78118 50718 19817 85717 69817 08816 89516 46616 23115 87915 72515 61915 44815 08514 87114 55214 28914 09013 98613 77213 62813 38413 19412 84912 58212 64712 42112 34012 22412 13211 91711 93911 54511 46411 34211 08010 96610 79210 78110 37310 16910 0609 9699 7959 6389 6819 7559 5839 3809 3449 1719 2178 9968 6928 8108 7728 6078 5798 4578 2418 1488 2857 9178 0317 7377 6597 5627 4717 4667 4367 4657 2897 4007 3097 0467 0976 9456 9476 7186 9496 6176 8276 7086 6696 5776 5816 4716 3946 3316 2376 3836 2026 2866 1516 1696 2146 0376 1135 8306 0815 8565 7555 7205 7085 7375 6475 7065 5795 7385 6205 4565 4975 4865 4875 4235 3945 3325 2435 1715 2185 1025 2175 0255 0554 9605 1534 9624 9185 0124 7884 8674 7714 7654 7624 6524 6794 5484 5694 6184 5194 3594 4204 4814 3604 4574 1834 3944 2414 2044 1154 1974 2424 2884 1914 3454 1454 0424 1034 0334 0584 0844 0493 9264 0453 9454 0683 9723 8653 9213 8423 7713 7143 8383 8703 7083 6343 6423 5443 7063 6413 4173 3903 4373 4473 4463 4783 5213 5633 5263 3833 4583 4823 2583 2543 3103 2433 2773 2253 4013 2783 3313 2943 1303 1543 1443 0573 0443 1492 9612 9262 9582 9512 9512 9552 8372 8682 7432 8262 7402 7012 7112 7332 7872 6582 7672 7192 7642 8482 7612 7272 7452 7172 6382 6732 6002 5802 5962 6402 5572 7882 6212 5932 6092 6592 6002 6342 6632 7232 5132 4852 5362 3822 3962 3912 4392 2852 3502 3602 3722 3422 3292 3152 3112 2452 3432 3602 2732 2022 2582 3012 2472 2362 2622 2212 2362 3092 2082 2362 3252 2772 2592 2492 2302 3042 3172 2262 1842 2092 2652 2282 2282 1092 1202 1422 1372 1542 0702 1172 1062 0662 1282 0782 0402 0312 2352 0651 9842 0372 0431 9932 0202 0002 0881 9702 1442 0472 2382 0862 1282 0672 1062 0362 0732 0722 0562 0281 9291 9712 0791 9591 9661 8941 9771 8991 9471 9841 8961 8821 8321 9811 9521 8951 9661 8831 7991 9361 8691 8921 8901 9621 8631 8561 8641 8531 7751 8301 8471 8581 9711 8791 8281 8141 9201 8301 8231 7561 8171 8731 8421 8751 8161 8301 8481 8381 7421 7981 7771 7831 6531 6601 6761 6601 6921 7201 5801 5911 5861 6821 5561 6061 6241 6571 6891 6741 5911 7461 5671 5621 4761 5921 4691 5371 5111 4071 4991 4181 4341 4591 5001 4991 4001 5291 4661 4971 4461 5181 5051 4051 4721 4961 5041 5781 4521 4931 4801 3361 4671 3941 3751 3571 4451 3211 3451 2921 3081 3931 3551 4391 4141 3831 3911 3861 4051 3871 3131 3961 3491 4271 4001 4001 3211 3051 3951 2871 3631 3091 2731 3811 3281 3501 3221 2411 3001 3211 3571 2861 3121 3031 3081 2641 2651 2421 2111 2671 1801 2431 2871 2851 3671 3361 2891 2341 2251 2761 2551 2941 2161 3021 2651 2681 1861 2991 3081 1991 2391 2091 2211 1781 1671 1771 2221 1861 1301 1621 1061 1621 1291 1911 1521 1461 1941 1101 1121 1931 1601 1001 1051 1821 1281 1431 1261 0691 0911 1411 0911 1101 0911 1131 1031 1281 1031 0681 0631 0721 1031 0011 0391 0841 1081 0581 0591 0551 0841 0561 1261 0461 0461 0841 0291 0411 0461 0131 0431 0121 0791 0109931 0241 0151 0939779991 0871 0511 0291 0641 0411 0001 0219941 0099599349169989699659669729861 0509319231 0209249841 0069621 014910891901882968936932919958900926884948936944918910914892964901884895909911875817834870858861843865780817867881841878828829788844794863863849828816791787839846823814817752814798761817796772737803750823797775776766725766738747703692762725807752726717702768729751743730750722726681693658730765752719723742784820724774777745761760761743777712699721750736794746743763798792741727783750723749743742736708731719739764725696947 235100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 016 43700000000006 197 841 80100000000000008 546 959 60500000000000127 061 766 39900000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G110G120G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %937 506 15999.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %936 309 08699.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 197 0730.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %469 571 47150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %921 324 27098.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

38.9 %364 896 60538.9 %61.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

38 723 648968 868545 5641 896 499808 116789 051933 9861 233 085544 943927 721441 241470 142520 802638 121434 569704 566485 479550 535673 138914 312990 701963 0391 125 559866 6531 404 7112 391 512149 7014 395 727223 520214 217454 967444 497210 600555 960235 093229 897377 866502 799134 438784 52311 774 458517 730484 895828 938689 9881 298 4001 122 9511 790 2402 899 081321 541438 727391 709503 316242 704420 113451 927315 3711 219 535320 797658 539846 878 852051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.85%99.88%99.88%99.88%99.88%99.87%99.87%99.87%99.87%99.87%99.87%99.88%99.87%99.87%99.87%99.87%99.87%99.86%99.86%99.87%99.86%99.89%99.79%0.13%0.15%0.12%0.12%0.12%0.12%0.13%0.13%0.13%0.13%0.13%0.13%0.12%0.13%0.13%0.13%0.13%0.13%0.14%0.14%0.13%0.14%0.11%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped