European Genome-Phenome Archive

File Quality

File InformationEGAF00003613383

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

171 850 60390 168 82735 447 53818 670 7226 843 7054 607 0501 915 3341 551 294790 114684 645432 856371 399285 767244 882207 662181 059159 926142 918131 349119 857110 344104 36797 80390 24287 22081 18178 02675 10971 52468 17765 58063 01860 61459 72157 35056 42253 88652 71851 34150 12349 09046 91646 08244 47043 16242 38142 94741 36141 20940 79639 29538 62938 81337 84937 05636 41535 77135 67634 50033 99933 57033 64732 78932 38032 53031 76231 08630 57429 91030 15329 54129 49328 78429 14027 69027 58728 25727 56327 10026 67626 75426 64126 28225 30325 58524 99125 15924 26624 69724 45423 92323 92723 78523 96823 77824 21123 32923 31123 20922 80122 60322 62722 45721 98722 14721 69621 90121 62121 65621 19021 17520 46420 80820 55520 54219 94420 14719 89419 49819 42319 82319 73619 67219 44618 91518 91819 05418 64418 31718 37718 26018 12318 28017 66918 07517 98618 12917 15017 63317 25217 23517 29417 14317 05217 13316 84616 74916 61416 70516 69416 54816 31216 59516 53916 53416 06516 29716 25115 90016 10315 71315 72915 26715 63115 26915 46914 98615 27415 28715 21615 22715 04915 03514 80914 98514 95014 39614 59714 29414 39014 22614 32414 15714 15813 90314 04513 66913 71113 81313 80213 69313 67813 71713 53013 36413 48312 93213 16113 06113 27813 18313 10112 81613 08312 69112 58912 69012 35612 73812 57312 45912 31512 44812 25212 37912 01112 11712 07811 92011 78611 62711 93411 92711 69711 60511 74611 80012 13311 52311 53011 64711 41511 59811 28311 36411 29411 19311 30711 38111 13011 10411 07810 99610 95610 90310 74211 15310 83310 82010 76410 40810 77310 54610 54010 31510 42410 70510 49810 47510 37510 34110 31610 36110 37010 12210 17910 1749 86210 06410 0259 9229 8059 9499 7709 8079 5679 5449 7379 7889 3769 6449 3269 4859 4989 0399 2059 3289 4349 0449 2289 1389 0229 2569 0528 9808 8408 9658 9999 1618 9848 7788 8419 1688 7628 9038 6978 4828 5678 7018 4518 5788 4618 3468 5138 2758 1208 3728 1388 2028 1338 1497 9278 1948 1867 9188 0397 9257 8478 1478 1057 8407 8087 9137 9267 7697 6827 5827 7457 6147 6577 7987 6547 6407 5067 6167 7317 5977 6257 5907 5507 5277 4447 5767 6057 6067 5087 3077 2987 3567 1817 1767 1387 0047 1207 1576 9346 9027 0076 8596 9596 9166 9306 7706 7856 7316 7816 6926 5526 4466 6056 6446 5626 4946 8186 6626 5916 6466 4776 6286 5576 5756 3586 3376 3346 2906 1716 2606 3076 4176 3786 2976 1705 9946 1756 1326 1666 2506 1706 0466 0966 0026 1616 0486 0696 0535 8435 9265 9985 8635 8255 8805 7875 7375 6985 7335 6975 7815 6965 6685 5775 4515 3455 4245 5085 4445 4815 3945 3175 4105 3705 3215 3555 3665 2965 3165 4325 1365 2525 1605 1765 1665 1115 1065 2005 1985 0395 0045 1675 0914 9605 0404 9885 0225 0645 0305 1355 0144 9904 9914 8635 0274 9434 7484 7624 6084 8504 6304 7864 7424 7204 6284 5034 6334 6854 6134 5084 3314 4934 5024 3824 3284 4264 3464 5064 4264 3354 4164 2234 2404 4244 3514 3164 2794 2354 4274 3804 2594 3294 1804 1084 1594 1454 1784 0583 9984 0794 0644 0043 9924 0264 0313 9443 9813 8953 8023 8573 7813 8083 7773 7833 8263 7423 5593 7253 7043 5973 5923 6433 6473 6763 6553 6183 5873 5753 6073 6243 4723 4293 5233 4953 5253 3833 6073 5043 5713 4993 3793 4973 4853 3913 3853 2943 4813 4403 2843 2473 2153 1823 2423 3203 3533 1753 2453 1723 3053 1283 2023 1563 0443 0483 0103 1953 1493 1083 0473 1242 9593 0123 1483 0192 9992 9322 9532 9342 8582 8962 8782 8742 8712 8062 8332 8482 8062 8202 8892 9802 7902 7572 7952 7712 6222 6742 5952 7432 5822 6612 7362 6172 6462 5282 6692 5812 5652 6122 6242 6162 6032 5692 5672 4592 4792 5782 5172 5532 4432 5352 4942 4752 3232 3462 3472 4412 3542 2852 3872 3832 3232 3522 3612 3902 4282 3482 4232 5462 3992 3332 4152 3412 3632 4362 2912 2342 3412 2042 3302 2282 2892 2862 2772 2552 2082 2002 1842 1952 1502 2582 2012 1832 1712 2172 2132 1262 1612 1972 2182 1192 2002 1191 9822 0942 1162 1252 0582 0912 0552 1182 0121 9552 0201 9171 9421 9601 8881 9261 8401 9131 8581 8471 8941 8601 8721 9101 7841 8381 8151 8991 7681 8511 8641 8281 8891 8931 8561 8181 8211 9231 8121 8071 8281 8571 7431 8441 8491 8311 8031 7311 7711 7761 8131 8361 7501 7621 7991 8061 7451 8171 7541 6171 7181 6891 6841 7181 6091 6851 6421 7911 7241 6401 6111 7211 7801 6121 6431 6761 6431 6351 7361 5981 7131 6801 6431 7261 7081 6221 6871 6201 6031 5321 5271 5571 5731 5701 5701 5791 4611 5311 5581 5121 4441 4961 4001 4161 4221 4321 4491 4531 4331 4251 4171 4881 3451 3381 4051 3501 3831 3701 3891 3941 3571 3681 3641 3951 3471 4101 3831 3021 3521 3641 3601 4051 4041 4061 3121 3451 3621 3281 3171 2951 3921 2801 2791 3251 3291 2951 2251 3011 2031 2971 2241 2711 2751 2281 2431 1911 2351 1831 1451 1871 2041 2161 2321 1611 1961 1721 1591 1481 1911 1291 1721 1201 1061 1631 1491 1581 0821 1291 0441 1301 1441 1261 1721 1121 1861 1911 0661 0981 1801 0841 1621 1181 0751 1441 1101 0711 0321 0101 0401 0621 0501 0581 1331 1021 1021 0261 0481 0571 0881 0551 0401 0141 0501 0311 0441 0521 0231 0141 053991965971999951986923955932935971882968952902910889884902895816885848863882820891794870886818814858853803807840818882798784766775846822842811916862821911847823827873775906871822866867790768811860818807785759794755728727698725749783740739746730771769708748715706689710724678730702213 851100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

403 59700000008 519 548000140 459 76000000000087 146 5600000103 710 1370000198 083 3710000420 767 6510001 827 552 07600510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %18 550 37399.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %18 524 31699.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %26 0570.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %9 288 80950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %18 367 99898.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

43.9 %8 158 96443.9 %56.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

378 1066 0283 4479 4214 0564 4867 3286 8456 1329 3212 8212 6024 8904 5671 8428 0412 5212 8575 1856 4705 8779 6088 5346 58011 84423 4111 24383 6631 5861 7094 2623 8452 3517 2281 3521 6563 9833 9721 00410 120134 6226 4765 0639 8717 35116 87423 19216 60582 2454 5347 4566 2298 3163 6586 3817 1175 50828 1915 47011 29317 604 626051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.87%99.86%99.86%99.87%99.84%99.86%99.87%99.87%99.86%99.86%99.86%99.86%99.73%99.88%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.13%0.14%0.14%0.13%0.16%0.14%0.13%0.13%0.14%0.14%0.14%0.14%0.27%0.12%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped