European Genome-Phenome Archive

File Quality

File InformationEGAF00004997244

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 335 3324 012 4113 445 4163 118 5372 906 8842 802 1882 768 5992 807 8652 885 6603 059 0283 342 7703 754 4624 325 5945 189 2066 440 1548 329 44311 107 21415 061 12120 536 07227 873 75937 230 23348 736 08262 267 15277 460 53993 896 119110 592 789126 772 443141 506 692153 864 558163 008 602168 471 148169 936 317167 566 664161 563 346152 437 957140 860 803127 599 652113 416 87498 988 18484 910 22971 624 12159 476 72148 691 01239 328 86931 372 18824 776 58419 412 31715 099 13911 693 6839 026 2566 972 2795 415 5634 231 8933 345 0662 686 9992 181 3631 802 3391 522 9951 310 1691 145 3781 018 943919 514844 286777 611725 487680 054639 015605 239574 814549 605520 550496 358473 205450 293429 712412 430394 263377 285361 530347 442332 876319 131307 459295 911285 567276 971267 643257 234249 029243 097234 565229 009223 051216 378208 544203 944197 741191 876187 078181 551176 780173 066167 858162 981158 840154 241150 015146 783142 573138 273134 136131 275128 087124 554122 342119 304116 891111 958109 755106 145103 11699 89497 24094 82492 04190 34187 99486 11284 93082 71780 57079 11077 79575 67773 56172 46571 11569 76968 55566 66565 35064 56662 25661 56360 27859 11857 63357 52856 12455 38854 31153 27652 56751 37250 41549 30649 21048 23847 80646 22845 86845 23244 31543 87643 18642 36141 62141 09341 00540 17840 01639 13138 60137 79737 03436 77236 23235 75135 01633 81533 95232 89932 78032 71731 97730 98730 79830 07130 21130 12029 57328 86928 60128 23827 49927 08826 61826 38626 06625 59124 75824 77324 53424 06623 57623 66423 14123 08222 82722 43822 16221 82421 30821 16420 66820 77920 10419 98119 68619 36019 11418 85518 66918 39318 43618 33118 26517 47217 14617 33216 92216 66316 18616 16815 71215 80115 46815 41315 37415 45714 91315 09514 55314 18514 22613 86113 87913 57313 66913 26613 45713 45613 13913 02512 84112 97312 47812 53312 10612 54712 04012 05411 85111 57111 75511 84511 48811 29711 23811 22811 08810 86810 95910 78810 30810 46410 17810 1749 86710 0959 87310 0889 6089 6929 4609 8199 5749 5819 6739 2369 3949 0359 1398 6808 6768 9358 6958 7278 6618 5728 3478 5568 4338 2968 2048 1328 1458 0828 1678 0057 9817 7537 6497 5027 4447 3347 3707 2017 5697 3747 3487 2987 1547 1677 1456 9797 0987 0027 0026 8676 7446 7916 7316 4726 6066 4456 3336 7326 5366 4806 1106 1916 3726 2696 1396 1656 1926 0386 1135 7945 9915 7735 7105 5615 6545 8265 7015 4885 6655 6615 4475 4765 4255 4875 3515 2485 0905 2905 2365 1795 0875 0715 1445 0734 8835 0284 8514 8634 9275 1234 8854 6654 6834 5474 5194 4914 6434 6274 4814 4904 4974 5974 4294 2954 4294 4284 5054 5524 4094 4534 3014 3574 3074 3174 3514 3224 4434 2094 0944 2554 1934 0494 0683 8983 9213 8944 0743 8623 9463 9013 9723 8403 7493 8923 7303 8073 5143 5563 6903 5553 5093 5083 5373 4963 5143 4453 4163 3483 3613 5613 4953 6333 4043 5263 4303 4103 3833 3843 2493 4793 3633 2143 2003 1533 2433 2503 1263 1953 0953 1463 1303 1343 0122 9813 0652 9383 0203 0092 9683 0323 0172 9043 0643 0042 9142 8972 8352 7992 9432 8622 7282 8962 7472 8152 7662 8582 6492 7262 6772 7022 7672 7252 7722 6402 6072 6302 6062 6132 5872 5062 5342 4712 5072 5292 6172 7712 5472 5552 5642 6262 4182 6492 4792 5592 5472 6102 5992 5562 4462 4702 4732 4522 5202 4352 4152 5292 4182 3942 3952 4332 4122 3542 4822 3212 3882 2132 3602 3832 3652 2222 3712 3522 1572 3262 3422 2382 1902 2512 1972 2852 1182 0682 2262 1102 1762 0722 2782 2202 0332 0692 1592 0362 1342 0952 1112 0312 1132 1111 9822 0851 9331 9882 1591 9341 9941 9742 0822 0341 9502 1042 1301 9241 9832 0371 9552 0582 0291 9501 9961 9611 8691 9781 9081 9021 9401 8711 8111 9151 8381 7751 8741 9431 7621 8841 9081 8891 8461 7041 7311 7771 8031 7781 7531 6891 7521 7411 7711 8641 8151 7651 7611 7501 5801 7661 7121 7081 6961 7511 8141 7901 7571 7311 7121 6861 6641 6861 6191 6541 6791 6411 6231 6521 6201 6631 6181 6731 7231 7731 7041 6841 7391 7081 6711 7091 7071 7261 7251 5971 6991 6081 5981 5361 5581 5371 5951 5671 6021 6321 5491 6361 4321 4491 5121 5351 5091 5851 5281 5341 5401 5051 4891 4771 5361 5181 5341 5041 5231 4611 4731 4621 6161 5241 5701 5701 4951 4901 4391 5221 5271 5441 5451 5691 4911 4851 5171 4781 4461 4321 4881 4711 6041 5531 5311 5521 5091 4471 3801 4221 3881 4501 4451 4241 5171 4461 4871 4511 3911 4031 3181 4541 4471 5251 4811 4331 4651 4581 3621 3751 3481 3431 2581 3021 3991 3941 4311 4141 4031 3261 3461 4401 3751 4101 4171 4571 3911 3771 4011 4071 3521 3481 3641 3461 3871 3421 3781 3671 3411 3321 3241 3481 4231 3811 3861 3651 3181 3561 3601 3981 3431 2881 4901 4621 5601 3701 4521 4411 3931 3881 3921 4381 4261 4161 3201 4131 3241 2781 3611 3151 2621 3451 3221 3821 3761 4101 2611 3401 3241 3091 3341 2761 2791 2871 2211 1921 3091 2061 2761 2741 2191 2471 2081 3281 3041 1921 2501 2041 2081 2661 1941 1861 1281 1621 1611 1631 1781 2281 2061 0671 1281 1341 1841 2731 1301 0301 0651 0631 0221 1341 0081 0351 0951 1151 0101 0791 0461 0821 1141 0771 0801 1171 1061 1411 0441 1201 1601 1541 1151 0291 0371 0391 0411 0551 0191 0711 0449699709839439741 0269861 0641 0489431 0359909831 0189919358959559971 0681 0079999981 0441 0169029419459339771 0079691 0719529489389149258689189069509129269319369339799439021 016983914933979978961929966931875949937904980920923913873870924992959917978852940948853893911888929942852872829817815794808808793775801813811830808745777761802760762835877833780779799728764805765788810803819747810824813821782824 327100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

543 20200000000002 582 725 91800000000000004 410 652 9210000000000092 738 904 35300000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %659 922 15699.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %659 448 55699.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %473 6000.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %330 241 14750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %644 231 31697.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.1 %46 878 8857.1 %92.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

36 001 559891 676739 7371 121 374853 889844 449898 9661 095 905745 248696 502402 491341 474433 229494 339435 903653 690508 669543 057544 568661 933723 240749 106992 142692 113933 6231 370 278157 2352 744 532187 035170 853332 119287 230188 217379 637167 575150 285216 620273 066103 902452 1047 414 435376 288282 326580 467465 471789 810942 820712 8421 770 570162 772273 079203 066292 313137 060245 791257 054178 903789 563158 275371 009593 670 108051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M450M500M550M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.94%99.92%99.94%99.94%99.94%99.94%99.93%99.93%99.93%99.93%99.93%99.93%99.94%99.93%99.93%99.92%99.92%99.94%99.9%99.92%99.9%99.93%99.73%99.76%0.06%0.08%0.06%0.06%0.06%0.06%0.07%0.07%0.07%0.07%0.07%0.07%0.06%0.07%0.07%0.08%0.08%0.06%0.1%0.08%0.1%0.07%0.27%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped