European Genome-Phenome Archive

File Quality

File InformationEGAF00008051959

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 117 1153 162 7052 725 2502 483 5202 311 0912 180 8612 069 1402 008 3001 959 4371 945 4081 928 3731 944 4942 010 5982 109 2182 270 4242 536 5512 888 6533 395 9554 060 7254 881 7685 853 3096 958 3408 131 8019 336 51010 503 20911 549 25912 413 95613 079 59513 496 07013 701 08913 723 24913 586 77313 383 69113 230 43113 172 89513 316 83013 757 46814 550 93215 797 25617 501 92919 743 61422 531 58725 914 19429 887 64834 423 18839 504 39945 070 04651 029 08257 264 70163 675 74770 174 72576 516 60782 575 71888 201 36093 197 57097 465 155100 826 233103 210 644104 522 371104 740 085103 882 453101 996 43899 133 21695 429 75890 968 15985 914 85780 417 54474 608 68268 602 08262 617 53356 703 25950 929 21045 450 01840 274 34235 466 27431 071 46027 073 34623 486 64120 286 05217 455 86314 983 34812 834 14310 967 0279 371 3168 009 4176 853 6305 869 8505 039 5494 340 0383 753 7193 246 4802 824 8962 469 8142 173 9251 916 0471 697 4011 510 6631 353 8631 215 7471 098 394997 560914 070836 709769 895713 489658 265611 190571 980534 593502 309474 077448 093425 149403 149384 279367 590351 272336 208323 300309 207296 896287 582276 185267 426259 907250 502242 855233 990227 551221 145213 013208 238200 577195 398188 985183 928179 625175 627170 863166 925163 423159 883156 144153 362148 994145 363142 361139 644136 611133 292128 934127 661124 903122 885120 238117 862115 968113 125111 504108 712106 654104 599103 289101 50799 89198 05096 00993 89892 51791 12689 53587 55685 05983 33282 31779 96378 74677 06675 41874 12672 81071 26570 25368 86667 27166 09465 19663 55162 87761 90360 23859 91458 33657 46256 54255 90353 69052 92152 20851 21150 86849 85649 18348 36047 62547 24146 75945 85845 06844 02743 74243 04642 84842 21741 68941 15440 05140 04439 56038 72238 07238 43137 76037 51536 76236 79836 05835 46535 50934 90034 60634 21833 48033 64233 00832 62432 15231 77831 23030 59330 58730 30830 27529 71729 49328 96728 60328 45828 22228 14927 94127 45827 06927 13027 26026 30626 15526 16626 06725 54324 95124 87524 77924 30123 94824 47523 86223 96523 71523 81723 40423 27923 17422 92822 27922 27621 75321 24721 46221 20821 14420 83820 53220 55720 26420 44620 13119 79619 76519 62319 05818 83018 76718 48118 23817 79017 81617 53617 32117 31817 07616 88116 86416 95616 75216 21316 09315 91515 58715 75615 68815 54115 56215 02815 00415 27014 85614 66914 31614 22314 11614 00313 75313 93213 55713 49113 54813 33213 11812 79412 76612 61112 65912 61212 53612 57612 28412 29712 19512 12911 99311 84311 86011 60811 62611 44611 09110 93710 49410 67710 37510 33910 47110 21210 1889 9239 8269 8349 6979 4419 6659 8549 6329 6919 5919 4469 5619 3959 1189 0118 8839 0769 2248 9308 7718 8338 4758 5958 4258 3828 2297 9938 1708 2308 1008 1368 1928 0787 8917 7217 4837 4137 5447 3607 4357 2227 2937 2887 4507 3117 2387 2877 1816 8236 9476 9866 9666 7586 5866 7736 6546 6926 7976 8836 5866 3956 4686 3086 6556 4416 2216 2586 5356 2836 3566 2716 1806 1345 9926 0896 0215 9465 6515 8495 8475 7485 7125 7865 7615 7745 5825 6185 5855 5955 6225 5705 5445 4545 5275 5255 5025 4515 4175 4465 4605 5525 3565 1985 3335 4385 0785 2415 2744 9674 8824 8954 8504 8904 8584 7594 8314 9324 8224 7374 8574 6734 6004 6474 5134 5644 5754 5594 3524 4284 4014 4074 4144 4344 3894 3504 2654 0744 1264 2414 1504 1764 0914 1274 0064 2424 1994 1054 0354 0113 9144 0363 8743 9573 9223 8423 9823 9233 8303 8043 6003 6783 7933 6653 5523 6473 7093 6043 6663 5223 5083 4703 3493 4653 5283 5083 5053 4273 5553 4663 4773 3963 4723 5263 4443 4863 2783 3613 3023 3463 3693 3023 4173 3663 2003 2143 0863 0583 0213 1643 1182 9853 0263 1673 0502 9412 9433 1282 9362 9002 9182 8952 9282 8222 8482 8502 9322 9232 9082 8522 7122 6822 8642 6242 7522 6832 6682 6872 7372 6762 6842 6482 7862 7082 6502 5902 6112 5562 5582 7722 6242 6332 6402 6192 5992 6352 6352 5922 5492 6352 5952 5622 6152 6572 4992 4652 4732 4402 5102 4862 4692 4352 3632 4692 4362 5192 2732 4352 3812 3982 4512 4422 3422 3892 3552 3812 3142 3632 3172 3522 3912 3902 2852 3292 2262 2132 1612 1632 1422 1452 1302 1892 0782 1562 1652 1572 2212 2402 2472 1772 0952 2242 1642 1132 0922 1642 0751 9922 1172 0132 0211 9621 9501 9722 0242 0682 0012 0712 0592 0572 0262 0832 0261 9731 9531 9041 9312 0271 9972 0261 9231 9551 9702 0331 9221 9231 9222 0661 9281 9041 9621 9411 8981 9181 9351 8881 9741 8281 8081 8661 8751 8681 9221 8981 8711 8821 9451 8561 8551 7841 7331 8541 7211 7301 7691 7661 8181 7401 7041 7021 7601 7581 7641 7791 7441 7921 6491 6731 6361 6641 5861 6721 6671 6731 6391 5381 5361 6411 5071 5101 6471 4911 4861 5371 4931 5241 4421 6131 6651 5401 5251 5221 4471 5511 4631 4791 5711 5321 5091 5191 4811 5241 4571 5131 4591 4341 4321 4321 4831 4961 5341 4421 3971 3881 4501 5041 4301 3721 4471 3811 3711 4171 3821 4071 3921 4571 4631 3951 4011 3231 4331 4021 4361 4661 3851 4021 3321 3941 4291 4641 4141 3651 3521 4011 3691 4181 4241 4341 3871 3211 4091 3171 2991 3381 2741 2391 2401 3161 2821 2181 2291 2811 2231 1951 2191 1621 2611 2701 2981 2781 2911 2091 1861 2671 1901 2221 2071 2611 1831 1901 1681 2161 1841 1851 2501 1981 1711 2641 2011 1771 1601 1101 1191 0551 1481 0741 1421 0551 0361 0131 0851 0891 0521 0961 0401 1151 0361 0559941 0221 0731 0241 0701 1121 0741 0091 0851 1321 1171 1221 1201 0481 0321 0571 0871 0281 0861 1061 1051 0291 0321 0029641 0169961 0099589909581 0349851 0031 0049529601 0331 0029941 0111 0291 0239941 0021 0279428719549309149589101 0059719819779469089219569199259559129599559069341 0239009089819088948828688988568958589061 0178669428868729029588728879058908368578818859108699281 0039168839828698608478588818988869158779029208668939179219209038588849318798838588961 095 319100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 259 82300000000005 619 355 08100000000000008 721 415 05300000000000162 745 805 23100000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 170 954 79399.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %1 169 336 30499.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 618 4890.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %586 393 49450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %1 147 318 89097.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.6 %53 878 3304.6 %95.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

51 262 8891 065 723847 5301 287 800985 8001 025 1801 141 5571 460 683977 354884 038481 279421 983570 085618 279473 877848 186639 969709 805695 016829 225845 522943 0771 253 628949 9831 416 7962 222 317205 9024 616 605247 549230 558493 712404 718267 696533 584241 361224 784345 742421 471157 073692 68812 165 337633 669510 189999 203819 2231 411 2481 616 8621 360 9903 024 881315 770494 505390 668539 699277 449553 574514 096338 4801 413 707315 516700 2391 072 071 078051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.85%99.87%99.87%99.87%99.87%99.87%99.86%99.86%99.86%99.86%99.87%99.87%99.86%99.86%99.86%99.85%99.87%99.84%99.85%99.84%99.86%99.86%99.63%0.13%0.15%0.13%0.13%0.13%0.13%0.13%0.14%0.14%0.14%0.14%0.13%0.13%0.14%0.14%0.14%0.15%0.13%0.16%0.15%0.16%0.14%0.14%0.37%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped