European Genome-Phenome Archive

File Quality

File InformationEGAF00008064904

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

7 012 4793 903 3483 032 4362 647 6682 416 4862 267 2662 156 9512 099 3052 054 4112 014 7761 995 9072 000 2832 020 5082 033 1772 065 5372 108 9332 170 0042 227 8492 318 0872 434 3272 584 0082 757 2432 975 5983 252 8973 612 6354 040 7844 576 4175 217 0675 975 8096 860 8987 881 0189 056 24310 379 01511 878 48413 510 29715 313 09817 254 41819 353 17621 550 21223 904 24026 325 84428 826 91931 360 27833 958 88336 578 65939 168 98541 732 32344 219 23346 613 98348 894 47551 064 11953 065 75554 900 59856 590 39258 116 52259 438 09660 591 85661 594 77162 403 41662 993 18663 420 16163 659 21463 769 14463 684 08963 400 78262 953 13162 357 24561 607 93160 678 40159 615 64658 380 11257 009 72255 491 07153 855 12252 109 53150 286 73748 335 76946 354 65544 328 61742 258 60740 201 44838 142 41836 091 00534 090 67732 138 47430 228 22628 384 60326 627 15724 934 53523 320 07421 771 29320 304 50918 938 27017 651 73716 426 92415 262 63314 160 87413 123 11712 151 74711 243 51610 391 4199 576 7658 813 4878 094 4257 422 1146 797 3646 204 6705 653 0785 145 1164 668 5244 233 2963 821 3733 445 3603 102 8252 787 6132 501 8262 240 5982 001 7801 789 1761 596 7861 422 7851 263 9411 127 5131 002 201891 098794 629710 270631 140567 788507 569455 699410 074370 543335 527304 959278 894255 645234 190216 945202 077188 866177 507167 291158 785151 055144 116137 114131 879126 370121 743117 475113 446109 847106 170103 505101 42798 75896 40093 54890 94488 72886 50984 35982 45380 48278 30977 22075 07273 51571 69269 78468 31467 61065 91264 68163 46661 75561 26260 18158 58957 01056 46155 15355 21653 87153 10651 83450 80650 25649 56848 18947 37546 70945 97845 49344 39644 02543 16842 68142 39042 09141 54541 01441 01440 40039 90339 21638 85638 51837 75937 09137 68036 32835 46834 83634 87534 69634 08334 25733 59833 47633 12832 81031 97432 12231 59131 31431 14130 77230 12029 97429 75729 10828 22628 01527 87527 77127 57727 08626 83626 50426 00225 62325 15424 94424 66624 10124 16223 58323 20423 13522 62322 36921 91621 96221 65621 24620 68020 58120 41419 76019 56419 46819 39219 34519 00618 51918 22618 04117 91217 03217 35817 41616 81616 75516 61316 01116 21915 81415 78715 44215 07815 15714 87714 64714 17414 15414 02813 73613 76013 36213 32213 23513 46312 94112 96312 65912 72012 37012 21312 35512 04912 12611 99411 87211 65511 31611 11210 96911 06711 04810 95110 57810 71010 45610 25010 0259 8059 7909 7209 7669 5039 3699 4499 2779 2509 2029 0549 0988 8109 0168 6248 6568 3458 6688 4488 4028 3298 3327 9077 9178 0318 0727 7737 8547 8927 8707 7667 5437 4917 6857 6017 6917 4107 4307 2177 3417 1627 1596 9866 8256 8796 6886 9646 8766 6136 7996 8256 7596 5076 4206 6296 4806 5886 3966 5536 2406 2126 1736 1526 3146 2836 2276 1445 9236 1076 0585 9755 7665 7495 8675 7815 7375 8125 6975 4725 3595 2915 3215 5455 3605 4115 2365 4315 2175 2185 2085 0965 0364 8465 0994 8584 5844 7794 6044 7684 5714 4664 4574 5114 6604 6784 4414 3264 2674 3484 3604 5104 4234 3854 3114 2874 3514 2024 2724 1874 1774 1954 1664 1524 1194 0464 1684 2383 9824 0524 2424 0383 9944 2263 9573 8443 8613 8783 7603 7353 6923 8233 6073 6553 6753 6143 7153 7693 7113 8163 6943 7043 6083 6333 5313 6233 6253 5703 4493 4563 5103 4963 3393 4433 4703 2893 4293 3483 2893 3893 2923 3853 4713 2493 2783 2283 0773 1233 1723 0382 9653 0993 0442 9972 9993 0282 9803 0192 9512 9462 8183 0442 9482 9062 9582 9532 9542 7922 9852 8532 8292 8512 8182 9292 8342 7142 9412 8592 8302 8322 7172 5872 5612 5992 5372 6372 5892 5932 5782 5592 5942 6742 4902 5692 4912 5232 4312 4962 6262 4682 4252 3702 3872 4282 3802 4472 4312 4172 4052 3302 3352 2682 3442 4282 3592 2842 3492 3612 3592 2902 3492 2372 2142 3152 2542 1882 1042 1612 2192 1872 2912 2912 2652 1842 2112 0252 0562 0862 0762 0612 0542 0651 9701 9941 9571 9982 0022 0461 9922 0351 9941 9001 8941 8501 9391 8731 9132 0251 9841 9701 9201 8301 8441 8991 9351 8541 7611 8401 8791 8441 9031 8961 7761 7791 7911 8181 7381 7912 0461 9071 9711 8731 8481 7391 8821 8931 7571 7711 7421 7371 7581 7091 6781 6721 7371 6741 6801 7631 7431 7731 7271 6441 6821 6491 6011 6391 6601 6641 6491 6751 5851 6711 6441 6691 5911 6641 6651 6481 6231 6171 7161 5581 6641 6471 7391 6391 6561 5541 6021 5501 5191 5161 5211 5401 6601 5761 4201 4951 4161 4331 3921 4201 3871 3961 4081 3831 4211 3951 2961 3351 3061 3111 3321 3081 2811 3001 3251 3001 2241 3091 2501 3081 3071 3121 3861 4181 3501 3401 3351 2561 3061 3381 2781 3321 2711 3021 2531 3221 3661 2801 3051 2681 3341 3371 3611 2641 2341 1691 1771 2591 2511 2391 3241 2591 2361 1731 2591 2931 2561 2931 2321 2941 2071 1841 1311 2041 2391 1601 0871 1151 1301 1381 1391 1431 1301 1761 0821 1261 0791 1211 0361 1521 1131 1179801 0381 0411 0801 0261 0391 0591 1051 0721 1151 0841 1361 0531 1641 0711 0101 1471 0471 0901 0991 0341 0469991 0611 0721 0551 1011 0691 0699709701 0751 0701 0051 0321 0181 0389311 0361 0171 0389939809691 0451 0399781 0019709339901 0399391 0001 0149219409629859599569761 0009399221 0059718639768749399549009299248789099299248949199431 0261 049881934936958909934928965893877872854866993837866851869809824852871833877858893841810812759812769836801769827823775801762768786818746773818793789811789781827742786818796773758790780862851833796829804766789753673764787666852780704728697739724749770832864820770762824756703773749760705744726793760791817777764734799798702673668720791691711725808714769726711686634716712740709764659667707711711673711672662750664700713724796686720708761669626738 892100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

17 431 89500000000008 053 211 439000000000000011 675 056 41600000000000176 573 721 26600000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 297 234 52899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %1 295 005 30499.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %2 229 2240.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %650 064 30850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %1 275 896 33298.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.3 %55 790 8594.3 %95.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

44 072 548859 314678 5341 112 952769 587776 871932 0171 487 078871 872777 173420 074362 049515 225524 061413 025711 599492 938547 921577 258661 857701 984798 8131 050 757781 1731 224 3771 856 530178 0614 519 738209 800199 082493 870349 271244 070489 631213 446197 589326 099364 598147 672651 42812 838 804654 451528 8341 018 033838 1871 403 5081 652 1661 489 4903 492 109324 676536 710399 023615 393310 960556 806500 928375 4671 539 559358 907787 2181 206 650 578051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.83%99.82%99.85%99.85%99.84%99.84%99.84%99.84%99.83%99.83%99.83%99.84%99.85%99.83%99.83%99.82%99.81%99.84%99.79%99.82%99.8%99.83%99.56%99.28%0.17%0.18%0.15%0.15%0.16%0.16%0.16%0.16%0.17%0.17%0.17%0.16%0.15%0.17%0.17%0.18%0.19%0.16%0.21%0.18%0.2%0.17%0.44%0.72%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped