European Genome-Phenome Archive

File Quality

File InformationEGAF00008050152

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

3 870 0592 953 3612 588 2442 349 7332 217 5262 138 4922 107 8192 161 8872 290 5172 477 7702 735 7903 023 3543 353 2073 649 2663 941 8584 228 2564 556 8884 916 8475 360 6885 910 6136 576 3907 348 3938 214 7259 157 57210 128 49911 108 39912 040 57912 936 43913 774 71214 603 97715 426 26116 319 72017 317 71518 517 01619 889 00621 467 44623 316 00725 403 58327 725 90130 242 42932 930 98835 828 53438 782 84041 832 94044 914 65947 961 79550 977 69553 968 54856 853 40959 624 90162 320 43564 775 93167 057 24069 100 12370 889 69172 393 29673 546 10174 364 21874 845 07874 916 07174 587 95673 895 84572 871 09271 474 74469 769 69267 824 40865 611 16963 195 23360 578 43657 855 61255 053 60852 197 32749 299 24546 406 88543 517 90240 675 50737 890 36935 193 29232 566 68730 047 57427 615 04325 304 36823 123 30421 048 96019 099 90117 275 37315 592 34514 020 59312 581 44611 251 74910 038 9728 931 3197 924 2467 029 7176 210 9125 482 8894 830 2394 263 5263 752 1433 310 6502 915 0202 565 8612 264 8961 994 7771 769 1381 570 8721 394 2201 245 7541 118 1891 005 276907 565821 685747 689683 958631 296583 486539 464503 526471 643443 441417 608393 533374 694354 840340 077326 453312 867301 225291 007282 014275 306266 969258 991252 440245 572239 187233 259227 982221 952217 473213 092207 478205 610200 273197 472192 408190 024185 303181 119177 889174 533170 642167 086164 230160 547158 567154 492152 693149 054146 747144 075140 062136 352133 799131 012128 279125 800123 569121 041119 622118 460115 537112 627111 096108 367107 721106 130103 624101 54399 78398 05095 50594 35092 66192 06689 70188 82987 50285 56984 60782 65680 99580 27879 73678 21177 38376 47175 08973 74373 21672 02170 80069 97269 09068 04767 20866 30165 45964 27864 19663 23962 50660 76460 35459 15058 01157 23656 75356 17254 98753 91353 67153 19052 18251 72650 34150 25849 87448 96348 55547 21546 80146 50245 76945 67844 61144 44843 80542 69241 90641 35541 06440 77140 03439 19539 04338 44737 65237 77937 21036 66736 59635 61435 61935 36934 99634 43234 27433 94033 53832 78232 62232 67632 52532 24031 65731 65731 54031 47430 48630 16729 94629 41229 40628 92328 65728 49128 34927 91927 82127 12027 29626 88926 97526 34226 33926 05225 99325 21324 70824 61824 63524 48523 92124 12423 88923 75523 24522 83122 69022 59822 05821 89022 00721 50621 36821 58321 39521 36220 91520 77020 43420 61320 34520 02919 82319 92219 74019 57719 59519 12818 79918 67218 63418 42218 00917 79817 63917 67717 24417 18717 02417 08916 75416 39416 45216 24716 15515 91115 93215 91315 60415 55515 23114 90815 42715 08615 06014 89414 83514 75614 57414 26414 16914 04013 83513 58013 35413 39613 05213 11513 08012 79912 82012 66712 47012 41112 33712 39312 61012 45412 41911 96311 67511 91411 55011 42511 15411 25611 01610 82111 25811 45811 19811 23110 99610 54810 88910 57910 77910 65710 39110 28510 26610 0839 97310 0309 6009 8889 8289 8399 9969 8189 7219 6269 3259 1879 2259 0959 4869 0318 9989 1239 1129 1019 1048 9278 8998 8329 0138 7818 8848 6848 5298 4768 4228 4048 3688 5828 5288 4018 2498 2048 3198 0858 0078 0927 9147 6067 8367 6547 6947 5507 7927 6357 5837 7347 6427 2607 5397 3457 2527 2097 3647 3807 2697 0367 0976 9967 0597 0747 0247 0256 6296 8496 7866 7356 7036 6446 5826 6286 7036 6606 6086 4506 4296 2556 1036 1186 1955 9545 9956 0525 9675 8116 0956 2535 9885 8685 8835 9805 8906 0895 6895 8315 6845 5895 4735 6625 4935 3935 5605 4925 6535 5655 6095 5035 4895 3625 3525 4665 2975 4585 4365 2865 2005 2045 2085 1765 0645 0665 1115 0614 9545 0285 0385 0444 9984 9274 8684 7724 7354 8174 9004 7334 9414 7384 7264 8564 6344 4974 4664 6094 6094 5694 5484 6464 5414 7104 3924 4034 5234 4954 4314 5004 4234 4114 4714 4304 2714 2794 3474 3644 3664 2614 3404 3124 3124 3644 2024 1724 0734 0753 9654 0743 9683 9463 9994 0113 8454 0223 9244 0083 9213 9803 9443 8223 8734 0223 8293 6643 7433 7273 6443 6603 6923 7503 7383 7683 6473 7333 6153 5243 7283 5873 5453 4753 6113 4163 4233 4483 3883 3603 3193 5273 2873 2953 4363 2073 1913 2243 2673 2873 2953 2893 3013 3233 3103 2793 3443 2713 1903 2213 3033 3423 1863 2173 0993 1183 1903 1633 2253 1573 1553 0923 1873 1273 1463 1193 1583 1443 0363 0703 1553 0362 9862 9822 9422 8902 7782 8172 8582 8752 7482 7892 8652 7492 9032 9252 9052 9432 8362 9032 8322 7592 7642 7522 7782 6642 8572 8582 7412 7352 6382 6882 7212 6092 5612 6962 6112 6052 5632 5952 6132 5752 5282 4892 4752 4942 5852 4782 4472 5682 5052 4372 4972 4082 3582 4972 3782 4522 4532 4832 3662 3872 3212 4362 3762 4072 4452 3542 4082 3002 4992 3652 2632 2982 3182 1622 2032 2602 3092 2712 1922 3462 2932 3282 2292 3162 3572 2602 3282 2612 1962 2962 2262 2562 2362 2612 2592 2132 2382 1082 2262 1432 2632 1572 1782 1222 1862 0932 1142 0662 0132 0912 1232 1161 9851 9892 0231 9551 9601 9721 8691 9851 8771 8911 9261 9651 9161 9971 9042 0261 9721 9821 9551 9802 0061 9871 9621 9751 9771 8751 9001 8821 9062 0231 8881 9741 9241 9651 9022 0692 0171 9121 8971 9202 0051 8851 8861 9711 8831 9361 8271 8391 8251 8291 8141 7861 8791 7871 7881 8101 8911 8181 8261 7951 8031 8591 7801 7561 7471 6651 6661 6541 7781 8071 7641 7171 8501 7831 6651 7201 6611 6281 5911 6221 6641 6071 6721 5981 6001 7191 6581 6031 6261 6341 6851 6771 6951 6471 6101 6521 4731 5831 6551 6321 6821 5461 5071 6051 5831 5781 6701 6171 5151 5751 5561 7011 6361 4921 5641 5801 5851 5611 5891 5141 5401 5631 4941 6421 4491 4831 4061 5281 5491 4831 6391 5051 4211 4371 4121 3841 4041 4201 3741 4301 3901 3771 4741 4091 4111 4181 4531 5081 4251 4891 3711 4211 4651 3741 4491 4561 4701 4341 4361 3741 3311 3211 4081 3861 3451 3101 3261 4001 2551 3311 3911 3181 2911 2971 3881 3911 3281 2891 2991 4061 2871 2961 3031 2841 2991 2231 3001 3271 2761 3191 2791 2661 2461 2891 3381 3051 2161 3341 2751 3211 3001 2921 2541 3871 3661 3561 2691 2771 2211 2051 2271 2361 3001 2751 2571 2561 1351 1941 2491 1951 2721 2041 2791 2741 3191 2501 2681 2621 1971 2751 2121 2341 2631 373 933100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

7 906 52000000000005 923 812 09600000000000009 214 714 52500000000000166 780 418 34500000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %1 203 397 83799.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 202 155 91899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 241 9190.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %602 406 79350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %1 174 829 54697.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.3 %123 675 03310.3 %89.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

62 343 2971 458 0901 197 7481 915 1361 349 8181 358 5091 452 9921 887 3181 249 6921 146 507655 095574 610722 074803 970702 9861 082 096861 542928 607900 3801 069 0761 110 0071 175 7161 545 8851 119 4211 566 2212 287 897263 2324 883 125310 856284 209585 792499 645325 492629 849284 157261 402388 632479 294184 540791 45213 396 080686 775542 7001 062 719866 0671 462 9961 691 1021 371 5313 146 394321 936514 757393 501559 719280 631501 135504 207367 1271 447 751326 847720 3131 087 053 548051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.91%99.91%99.91%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.91%99.9%99.9%99.89%99.89%99.91%99.87%99.89%99.87%99.89%99.85%99.64%0.1%0.11%0.09%0.09%0.09%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.09%0.1%0.1%0.11%0.11%0.09%0.13%0.11%0.13%0.11%0.15%0.36%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped