European Genome-Phenome Archive

File Quality

File InformationEGAF00005004057

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

7 329 4723 511 7572 659 9392 357 2472 174 6552 039 4601 942 1691 890 4791 850 0691 821 6681 808 2451 819 5421 861 3651 910 3141 964 6252 057 5952 175 6742 317 5642 495 8922 722 8323 001 3113 353 3173 787 3414 329 3864 983 8495 800 2736 792 7337 946 0599 330 72510 937 71012 792 90514 942 07317 393 80920 142 34023 189 20326 606 93430 319 42634 391 59838 770 05343 454 30548 446 98953 685 15159 091 98564 616 70370 141 23375 594 37680 809 20185 724 35790 196 48794 102 45297 377 99799 872 140101 586 529102 419 299102 416 590101 528 49299 818 17097 272 89993 988 64790 029 62285 539 63380 595 22675 317 34469 822 04664 211 14258 556 59052 946 44147 523 69142 292 62637 376 28232 781 46828 536 71024 661 47221 167 88118 028 32115 279 90112 869 37910 786 2458 989 6057 463 7536 172 1765 099 6574 206 0083 477 6382 880 4722 393 0181 995 8851 677 5251 423 7071 220 7921 055 767925 371821 762739 102670 361616 942573 015537 091505 861482 067458 659438 641421 400404 396390 430378 997366 462356 859347 354339 353328 034317 151310 067302 416293 616287 384278 554270 624263 305255 591250 663243 579237 466232 311225 203220 573214 652209 534205 578201 748196 782192 410187 183183 452179 199174 562170 434167 074162 522159 200155 453151 689149 898145 390141 940139 044135 868132 664128 812125 784122 216120 298116 004114 008111 322109 038106 576104 029101 76499 23196 80595 23492 78990 51489 34087 44986 52384 63682 82381 40180 26578 59476 75076 11773 76372 80471 24970 56168 80868 41867 11065 61664 99464 39862 62262 33060 82660 18159 80058 69457 45456 45555 58354 42152 79952 24652 06750 85649 37149 37648 26247 89447 57546 72846 04045 27445 14544 98644 48043 67243 00142 46541 76041 56341 22440 79839 78639 36539 06538 39238 17737 44437 41136 98636 19036 18435 17035 21834 73834 56034 48134 08633 61532 95433 12932 13631 95931 71131 48731 29230 94930 37330 19329 79429 53729 32529 39429 18628 42128 07727 64827 76027 76927 30826 67525 86525 75925 40325 23525 25625 11624 93224 27224 10923 72923 52423 51823 18822 95622 79322 24921 96921 70121 68521 58420 84420 88420 63620 42220 50920 34019 86819 72118 96219 27718 94918 83318 60918 20717 80417 66417 69517 36417 15817 23916 93116 65316 52716 53916 30316 66316 48116 29016 39715 93315 63415 81915 54515 52615 18214 79514 58514 48714 26614 25713 95714 01113 93713 58313 94313 54613 23313 12612 95413 08212 77112 82413 02512 86712 69912 29712 55512 31312 13712 04512 00612 08211 88811 61911 64211 71311 58011 25811 18411 13911 30911 06111 07610 55310 69410 69110 59710 46210 68210 55610 15610 26410 0769 9879 7129 7299 8589 6509 7209 4859 6769 4299 3139 4289 0189 1179 0478 9408 9918 8828 9829 0018 7888 7478 4438 4838 2548 3428 2528 0768 3018 1927 9577 9557 8597 8927 8157 5407 8707 5147 6867 7777 7597 6257 4807 4127 5387 3737 4357 5537 6047 2847 3177 5047 2757 3107 1287 1426 8856 9296 7346 6656 6426 7106 7296 5636 6196 3806 5856 5996 3666 4746 5196 3706 3736 4316 4126 3026 2386 3616 1346 0635 9255 9165 9005 9015 6815 7495 7165 7225 6355 5255 6145 5955 3855 3315 4515 4465 3825 4465 3915 4175 3045 3415 2705 0495 1185 0094 9885 1605 0164 9544 9805 0014 9634 9694 8915 0134 6804 8604 7164 7574 7744 8764 8714 7264 7544 6544 8124 7304 6924 5394 6134 6814 4814 5344 4404 3414 2954 3524 3264 3094 2704 2454 2304 3484 2604 3304 1594 0444 2714 0283 9784 1763 8653 9373 8453 9364 0203 9803 8903 9853 9153 7693 7783 8633 8763 6703 6083 8023 9343 8333 7623 7283 9103 9473 8243 9243 7603 7303 6433 7203 6443 6563 6253 6223 4463 5423 5473 5413 4543 4703 4453 5583 5053 7213 6003 3483 4973 3323 4693 3393 4463 4453 3133 3203 3483 2693 2963 2533 2253 2103 1513 1463 2033 0013 0383 0263 0343 0633 0813 1312 9663 0363 1623 1633 0933 0943 1463 0223 1383 0803 1503 0952 9332 9052 8912 8682 9202 9982 9642 8702 9063 0012 9022 8602 8182 7982 9132 8352 7642 8102 7402 8242 8352 9142 7942 8012 7092 7412 8042 8852 6202 6112 7512 7682 5452 7172 6592 5962 5902 5432 4832 4902 5002 4622 4822 5212 5642 5052 5352 5662 4672 4482 4872 4592 5522 5012 5502 5322 4982 4302 3992 3812 3842 4752 3862 4322 3822 5302 4032 3042 3762 2722 3332 3342 4332 3792 3832 5122 3592 4822 3842 3552 3052 2772 3602 4902 2832 1932 3592 3482 4272 4022 3312 3162 2762 2142 1912 2352 1982 1822 2272 1722 2862 1572 1462 2472 1922 2392 2882 1732 2612 1032 3152 2392 1932 2452 1762 1662 1872 3372 2992 1902 1892 2542 0021 9912 0351 9962 0271 9982 0152 0251 9411 9991 9461 9341 9961 9381 9071 9022 0421 9971 9402 0141 9961 8481 8501 9061 8861 9601 8631 9412 0221 9531 9051 8901 8851 7741 8871 8391 8561 8151 9121 9451 8821 7761 7441 7281 7901 7701 7371 7681 7521 7141 6701 6701 6671 6921 5861 7451 6031 7001 7521 7561 7961 7391 7641 7361 8341 7831 7141 7551 7921 7211 6331 6611 7501 7071 7321 7471 6551 8021 6921 6701 6711 6751 6591 6051 6671 5631 5911 6501 6541 6271 6401 7431 5861 6141 5641 5441 5311 5261 5701 5431 5981 6421 6411 5821 6091 4451 4931 5091 3831 5031 4361 5421 5581 5421 5541 5251 4751 5681 5471 5401 4761 5071 4731 5321 5291 5801 5201 5461 5661 5211 4871 6241 4971 4471 5001 4841 5391 5361 5131 4371 3601 4061 4101 4281 4071 4071 4071 3671 4721 4271 4791 5141 4641 4031 4561 4431 4631 4671 4111 4121 5311 4181 5331 4601 4281 4181 5091 3911 4261 4411 3841 4501 5331 3321 4371 3921 4801 3741 4051 4681 4021 4561 4571 3381 3351 3061 3711 3311 3941 4111 4121 3781 3381 3131 3331 2621 3961 3421 3511 4301 4131 4121 3571 3201 4151 3811 3211 3621 3531 4811 3441 2421 2861 2901 2641 3021 3471 3171 3211 3251 2651 3161 3601 2251 2711 2361 3121 2991 3241 2601 2421 2641 2721 2251 3281 2101 2471 2921 2481 1451 2891 2401 2871 2921 2341 2571 2671 1561 2371 2501 1721 2171 1931 2101 2311 1321 2201 2081 2071 2211 2321 1971 1971 2881 2091 1741 1451 1741 1131 1561 0951 1881 1581 1641 0951 0621 0351 0331 1151 0681 0251 0471 1411 1061 0631 0721 1431 1241 0731 085 526100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 847 38700000000005 635 130 18600000000000008 314 674 34100000000000150 489 032 94400000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 087 117 82899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %1 085 759 77099.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 358 0580.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %544 505 57950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %1 065 395 22697.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.3 %57 964 9895.3 %94.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

52 819 5911 234 6651 008 7471 451 3351 142 2061 147 5631 273 2711 589 1271 078 6681 014 665586 250509 010655 384714 622587 312922 747694 086742 801768 144908 785944 0571 006 8111 293 212916 4951 327 5952 035 883228 1014 289 593268 146244 697521 812429 049279 793553 495246 065226 074347 187417 343161 788688 80212 044 990616 576484 360950 430779 1791 293 9881 530 4021 229 5292 954 552292 022469 394355 390512 594253 588457 610465 082314 7521 331 644297 755663 357987 575 988051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M900M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.86%99.89%99.89%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.89%99.88%99.87%99.87%99.86%99.88%99.85%99.87%99.85%99.88%99.63%99.57%0.11%0.14%0.11%0.11%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.11%0.12%0.13%0.13%0.14%0.12%0.15%0.13%0.15%0.12%0.37%0.43%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped