European Genome-Phenome Archive

File Quality

File InformationEGAF00008201833

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 981 9653 969 4553 561 2153 317 6213 172 1193 072 2943 011 4612 958 5812 918 7902 899 2042 903 0702 920 0532 961 6543 078 7363 194 8123 411 4453 701 5414 108 7304 620 0595 264 0456 013 9236 878 9987 842 1628 864 1059 913 23210 965 49711 954 33012 883 48413 711 04714 475 22615 110 53415 646 63816 112 24516 574 04817 046 32917 564 46418 188 62618 910 26019 830 77920 912 40922 205 63123 695 60725 396 72827 267 11829 348 87131 565 75633 935 11636 466 05239 076 65041 759 41544 537 80347 392 18550 269 94953 161 45856 001 49758 799 94761 526 28364 106 50866 514 67468 695 23470 638 39972 317 83373 600 59974 528 87975 055 00775 209 64774 977 70574 278 02773 151 93271 632 37669 776 57767 530 42365 017 42362 235 82959 272 42656 140 44552 913 53649 600 60146 296 86442 975 49939 740 96836 575 23533 551 98130 668 66927 905 90425 327 45922 905 81420 652 65918 575 85316 655 65014 919 83613 310 90811 874 36210 578 2599 392 9738 338 3027 379 3526 519 5705 751 0755 069 7914 465 6443 928 7623 455 4803 039 2962 678 1602 350 7132 070 6061 819 7511 596 5701 402 1821 234 4341 085 629960 584850 302755 166671 310597 204533 449480 184432 474392 475357 176324 935300 027275 404254 469237 641221 552208 450196 772185 588175 946166 982158 927152 802146 015141 512136 711132 978128 046123 101119 404116 609112 948108 606105 158102 53199 76797 62394 00892 11590 40587 58786 09384 68582 45279 75378 53276 62575 39073 37571 63570 48468 07666 92564 90763 28062 45161 02158 95657 64056 24254 33753 18151 66150 26749 37147 93246 63445 95344 54343 42042 36841 36840 45539 52438 82038 21937 62336 90936 13234 94034 46833 88433 45932 96932 46531 60331 35530 77729 85530 07929 52128 71028 32427 89728 06927 42226 83826 66626 23825 93025 98225 45724 97924 81324 59624 49723 79223 79723 69722 87722 20822 15022 07622 27521 53120 92221 03220 59420 50320 35119 88319 80219 71019 35219 08818 85118 10618 15717 99517 67517 31217 26617 47917 08416 36616 43216 18416 10516 12515 89115 75615 44015 44715 57415 60515 05415 18814 85514 48514 38414 71114 48714 05014 13313 64013 64913 69213 45013 30712 90012 83412 65212 49612 45912 50412 37211 84111 93611 62311 46711 48411 35911 20011 06810 81311 06911 02010 92210 81910 68010 69010 65610 59610 46610 18410 26710 0519 7719 8739 7559 4939 3979 4949 2049 2039 1108 7328 7988 6268 5378 7288 7918 6188 4068 4628 3528 1568 1308 1677 7947 8547 7137 8177 7077 5797 4377 5137 2947 2077 3067 3277 0547 0947 0756 9146 7746 6766 6326 6066 5216 3296 3596 1676 1266 2466 2716 0176 0086 0615 9595 8605 5975 5545 3575 5485 4325 5065 3545 1635 3725 2995 4565 1345 0695 1735 0415 0494 8794 9314 7644 7604 7034 5754 8024 5694 6924 5894 4834 4304 4504 3584 4854 3624 3264 3144 1964 1874 1193 9433 9863 8443 9953 9573 9803 7993 8313 9203 8433 9563 9123 8263 8973 9493 8233 8653 8003 6443 7793 8693 8243 7603 3913 4913 4673 4163 6013 4943 5753 3313 2903 3943 2833 3133 2283 2203 1873 1393 2293 2613 3003 2623 1563 1833 2283 0843 0523 0123 1843 0022 8992 8622 8132 9402 7642 8292 8542 9042 8672 7982 7902 8032 7482 6572 7702 6592 6452 7122 5332 5652 6522 6172 5512 6752 7332 6682 6862 6522 5632 6532 5862 4992 5792 5802 4902 6512 6022 5462 4082 4382 4052 3202 3592 3852 3092 2902 2772 2442 4212 3642 3022 1892 1282 2622 3632 2182 1302 1062 1492 0782 1112 1472 2292 1542 0742 0512 1192 1322 0031 9962 0171 9841 9572 0032 0302 1492 0902 0301 9472 0662 0231 9731 9271 9251 9991 9661 9551 8471 8251 7401 8431 7391 7501 8591 7121 8541 7791 7681 7911 8001 8271 7781 8321 7881 7031 7451 7261 6821 7401 6651 7541 7351 7461 6721 7231 7231 6541 7851 6891 7161 6401 6401 5851 7281 5871 6911 6261 7271 6191 6321 6641 6001 5671 5081 4561 5881 5161 5511 5301 6411 5221 5771 5811 4321 4521 4941 4621 5051 5191 5161 5091 4541 5561 4951 3831 4371 4531 3941 3621 3471 3761 3881 3291 3891 2481 2861 2931 2811 2951 2671 2691 3051 2681 2271 2251 2611 2561 2651 2461 2421 2901 3291 2931 3041 3191 3851 2881 2511 2961 2611 2631 2871 2861 2201 2301 2781 2581 2241 1981 1731 2631 2111 1851 1381 1561 1481 1711 2261 2321 2041 2551 2261 2211 1951 2191 1961 2491 2091 1661 1471 2231 2111 2081 2711 2321 2201 2431 2191 1411 2301 1411 1121 1071 2021 1361 0351 0811 0971 0771 0561 0551 0641 1181 0991 1171 0971 1721 0911 0891 0959851 0761 0001 0611 0111 0291 0771 0851 1001 1021 1181 0401 0191 0771 0381 0441 0871 0559891 0209991 0121 0231 0139549489119579889349811 0319949639699849189549319069339489641 0119579389239169638949649379251 050984896986900984971936914904876899869924972919937882866865943893901797819808815870876809845786817799837873863857817811856918881839837774764842816847893843877797770780747828842780763734764754716770782722779772812772805718747794760758719711736738717730690699731750785764658751730715766699697717637649647702679648683730692694709733723734750733690757691776723686691663645678649657666672658705703756709713651684649647630702641663653666728687704734720693672670695708696651649635627557592639594607585610582548578619581645545552552604550540561555584600552597590619610635647591609570576581599603570604543563578561544552510525544575491557501542558554553565528575583552542573557532609583572579537517513507540566523548573527509492553542581519482527576528578568586558510547525515567 825100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 257 90100000000007 787 467 965000000000000010 682 333 00100000000000169 805 877 83100000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %1 243 323 54299.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %1 240 108 71699.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %3 214 8260.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %623 443 49950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.2 %1 224 522 87898.2 %1.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.8 %72 561 0885.8 %94.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

38 673 596662 940515 490874 521609 077626 295793 0461 308 152746 350630 107328 571285 035431 715409 794319 824583 858401 923455 103488 722533 162539 560644 957881 257687 1101 142 1931 674 526148 5664 234 968177 311167 363448 256301 227206 809418 926180 394169 709299 239314 211125 421577 01012 428 644622 054514 839953 915791 8691 329 9641 596 5671 451 4503 399 664320 418524 433381 760595 262301 082549 826484 285378 0301 510 897375 287787 7031 161 982 525051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.75%99.74%99.75%99.76%99.75%99.75%99.75%99.75%99.74%99.74%99.74%99.75%99.76%99.75%99.74%99.75%99.74%99.75%99.74%99.74%99.73%99.74%99.75%99.32%0.25%0.26%0.25%0.24%0.25%0.25%0.25%0.25%0.26%0.26%0.26%0.25%0.24%0.25%0.26%0.25%0.26%0.25%0.26%0.26%0.27%0.26%0.25%0.68%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped