European Genome-Phenome Archive

File Quality

File InformationEGAF00007988944

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 056 3734 396 4784 440 4414 715 1525 083 1795 396 9585 590 3655 676 6275 713 5945 781 3945 997 7676 451 7827 231 6388 351 3579 784 55311 486 56513 393 08415 393 67817 414 19519 387 76421 272 65823 123 65824 973 22226 929 04629 052 93131 555 34534 497 14637 986 38942 092 42446 943 51752 553 72758 885 44165 809 28673 273 44180 979 17688 758 19296 291 669103 224 993109 292 786114 292 330117 874 311119 868 322120 215 320118 853 036115 869 813111 408 215105 648 14798 879 93991 330 04083 280 26775 014 64166 744 05258 717 54051 103 27943 963 13637 457 98131 639 22726 484 56721 996 44818 130 38814 860 12012 112 9529 840 3237 972 9576 445 6155 221 5014 237 6093 448 9362 828 6082 342 8081 949 0801 643 7221 396 6011 205 9981 048 834927 177829 459748 154682 348632 707590 051552 524521 065494 438471 501448 716430 858414 570399 863385 374371 821356 179344 960334 137322 981315 176305 091296 392287 554278 539269 501262 352255 227247 717240 055235 653229 988223 577217 143210 858207 209203 070196 570192 095188 970184 526181 228175 908171 684168 559164 706160 386157 132154 764151 482147 967145 084142 744140 045136 493133 723130 311127 557125 270122 490120 765118 630116 570114 238112 840110 228108 171105 868103 815101 89799 64998 19696 24593 83891 95390 95988 68788 22087 95785 64883 95882 36380 96879 45078 65577 79576 07675 62573 89172 91171 89671 12270 29469 40068 18867 89466 42765 78864 46363 07762 69361 82960 91159 96058 51858 00657 31256 40055 72155 00254 00853 51352 46651 84251 41450 82349 78048 83948 83047 83847 46646 60246 10845 19444 41544 07443 36442 86142 66642 16741 99441 66041 13740 44739 90739 12939 40938 62837 93337 91937 47337 05736 95236 32036 37435 42335 14234 70633 68433 19633 14532 83932 36332 03731 56431 28530 94730 26029 89729 50928 99529 20328 53928 41228 09327 48527 56627 37626 85326 25326 46725 30025 09724 84824 62124 88124 64624 26224 03323 93123 41823 41223 21022 29422 01021 97821 80721 77621 46720 96120 86320 55020 59820 15319 97319 88919 80919 72219 86219 51918 68719 15618 80618 36618 56418 58318 39318 12917 93117 62517 54117 47717 40117 43117 27916 87516 56316 73916 17116 29216 15916 07915 84215 99415 41215 40315 55615 10515 08315 01014 78314 72414 59314 19213 74613 81313 24913 55813 63713 89613 19913 05112 87412 92312 76112 60212 66412 72712 51512 57411 97312 10212 01511 84611 89711 68311 69011 41311 48411 16511 14111 01611 02610 77310 79411 02410 47510 49810 56310 64710 27610 50010 15910 13210 0479 83510 0579 5479 6869 6569 5709 5929 3809 3499 0698 9179 0978 8728 8968 7028 8138 7248 6028 5728 4138 5088 1608 2978 1157 7397 7298 0837 9957 8607 8247 6667 7237 8987 6447 5957 4457 4197 4037 2687 2017 2237 3577 0827 2817 0627 0487 0556 9526 7976 5326 7216 7706 6446 7856 6766 5236 3546 4096 3926 5206 3936 3786 1116 2976 2996 1706 1296 2566 0456 1976 0405 8985 9886 0406 0015 8775 6705 7955 9335 6275 8105 5975 4865 4545 5725 5185 6105 4735 5305 4185 4055 4445 3455 2715 4645 3415 0635 0105 0345 1475 0124 9824 8944 9154 8554 8344 9094 8054 8534 8984 9664 7194 7214 7434 7464 8184 7824 8334 8224 7274 6064 6594 7054 6684 6854 5494 4914 6034 4694 4884 3544 5184 4494 4114 4274 4064 2554 3704 2344 1434 2834 0994 1904 2034 1404 1363 9244 0304 0013 9364 1234 0144 0114 0393 8914 0573 9644 0173 9903 8933 9833 8443 7223 8173 7483 7623 8043 7283 7763 7393 8163 7183 6273 8723 7733 7533 8243 6563 5163 5603 5543 5363 4273 4573 6383 5183 5393 5543 3523 4773 5073 2893 3303 3953 4343 3363 3163 4323 3753 3973 2853 3943 2853 2873 2443 1463 1273 2003 2353 1543 2333 1693 2223 1993 1023 0723 3163 1763 2503 1783 1783 2153 0552 9893 0143 1383 0863 0313 0363 0292 9342 8632 9382 9032 8282 9182 8872 9102 9492 9572 8392 8892 9352 9212 8762 8782 7842 7302 6262 7652 8512 7902 7002 7802 6072 7082 6132 8192 6662 6062 5712 7052 6422 6862 5492 6122 5902 5732 6162 6772 5252 5712 4692 5732 5852 5882 5292 6132 5072 5272 5142 4932 6012 4622 5142 4912 4902 5542 4742 5732 4212 4452 5232 4392 4822 4422 3722 4192 4022 4272 3412 3832 3692 3502 3372 3112 1882 2942 2382 3412 3112 3692 4262 4012 3212 3492 2762 3222 3462 1902 2532 3292 3042 3032 2182 2342 2092 2182 1042 1972 1512 1822 1902 2102 0412 0522 0872 1632 0712 0842 1442 1092 2692 1232 2132 1512 1322 2152 1492 1122 1342 0782 0132 0412 0602 0192 1262 0582 1332 0032 0122 0301 9832 0752 1262 0402 0391 9831 8531 8081 8581 9001 8541 9241 9761 8001 8851 8331 8471 9021 8211 8661 8651 8711 8191 8101 7741 7511 8201 7581 8291 7841 8231 7831 8281 8061 7181 8231 8041 7421 8271 8371 7931 7331 6531 7011 7061 6491 7071 7991 7131 6371 6981 7301 6921 6271 6401 6761 6561 5641 6081 6751 7301 6731 7421 9071 7931 6441 6431 7851 6021 6381 6821 6131 6471 6571 6351 6351 6021 6771 5811 7251 6451 5851 5651 6261 6201 6361 6451 6341 5511 6731 6181 5351 6141 5521 5821 5361 5611 5261 5221 4621 4681 4731 5371 5761 5511 5041 5741 5611 5901 5511 5011 5941 5021 4301 5351 4591 4111 4161 4501 3761 4171 4351 4121 4921 4401 4441 4211 3981 4221 4171 5111 4921 5261 6391 4491 4781 4051 4361 4621 4951 5001 3671 4581 3581 4301 3761 3661 3541 3701 4031 3561 3061 2951 2991 3211 2781 3121 3231 3311 3691 3091 3181 2921 3651 2831 2781 3371 4671 2711 3071 2081 3591 3481 3221 3761 5021 3271 2931 2881 3131 3121 3231 2831 3391 2611 2521 2331 2801 2221 2561 2621 3141 2611 2011 2481 2301 2421 1881 2241 2061 2201 2401 1791 1531 1081 1811 1861 0981 1301 1741 2231 1771 2041 2121 1921 2361 1901 1001 1631 2191 3121 2661 3671 3021 1721 2381 3081 1711 2041 1331 0561 1561 0841 0981 1401 0781 1201 0611 0761 1001 0701 1071 0741 1311 1021 0571 0561 1121 1121 0711 1561 0691 0211 1041 0471 1321 0321 0661 0211 0191 0759981 0079891 0621 0641 0491 0881 0561 0231 0661 0951 1141 0251 0521 1081 0231 0271 0741 0579801 0011 0291 0261 1081 0971 0401 034993 456100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 266 54500000000004 703 847 34100000000000006 723 236 97700000000000114 802 282 29500000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G110G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %834 702 41199.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %833 646 70099.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 055 7110.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %417 982 22950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %818 142 58897.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.5 %70 821 0918.5 %91.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

44 735 4981 022 384817 2971 322 024933 714929 1541 031 1791 071 321865 008829 608490 966418 623541 260603 592483 504763 247611 941657 576646 454755 110793 880821 8741 074 514780 1231 129 2581 646 010186 5403 462 884221 369200 295413 302350 511218 633455 046202 533186 738288 657339 196133 987563 4889 632 531489 168403 511758 218629 4231 061 0061 189 8461 014 2352 195 884245 603369 296292 333406 023215 829371 716369 626271 9361 036 979257 691534 104751 324 361051015202530354045505560Phred quality score100M200M300M400M500M600M700M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.88%99.86%99.88%99.88%99.88%99.88%99.88%99.88%99.87%99.87%99.87%99.88%99.89%99.87%99.87%99.87%99.87%99.88%99.86%99.87%99.85%99.87%99.8%99.74%0.12%0.14%0.12%0.12%0.12%0.12%0.12%0.12%0.13%0.13%0.13%0.12%0.11%0.13%0.13%0.13%0.13%0.12%0.14%0.13%0.15%0.13%0.2%0.26%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped