European Genome-Phenome Archive

File Quality

File InformationEGAF00003612730

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

150 252 62675 979 93228 392 73714 918 7595 217 6583 702 1661 461 0671 251 007631 107561 744360 397317 236247 782218 093182 615166 987148 785135 219124 507114 138105 55899 55993 64189 21483 16380 54876 93173 34470 57168 13866 22063 52161 27860 06957 15456 22954 92553 88453 27652 12849 87549 34947 81547 43945 83644 88943 65743 16041 96741 11341 12240 16140 01038 75038 00537 97636 16635 73536 04335 50335 01134 18634 27432 97232 32932 85032 12632 09431 74431 01831 36630 72629 96829 64529 47728 73929 59828 36727 94227 75027 47627 15126 77326 31526 67425 90626 20825 48425 10124 77324 73724 58124 69924 60923 75023 35923 94423 38423 61223 12723 27123 19023 21222 59322 31322 47522 04621 92422 15522 39221 71221 52221 72220 85421 21121 21021 06720 31620 91521 02519 91919 91419 66419 35519 79319 49819 31719 36119 03819 32019 11218 95018 76218 65718 31418 64118 14118 00318 23418 14818 03118 15217 93217 92917 78317 82217 62717 74117 57916 96217 45317 14517 17017 31416 84816 73416 64116 63716 37116 28916 16716 27415 94015 84115 91615 46215 62415 49415 46015 57615 10415 44315 21214 91315 00514 69114 86514 58314 83014 87314 67014 39214 88614 59714 51014 24914 27714 14413 98214 04214 06513 80513 81313 68513 63313 63313 32713 40713 55713 30113 44213 25513 24513 39913 17312 90812 88212 70112 72712 52012 48412 44212 87212 36212 31012 11512 16112 22811 88812 37511 93511 80711 82011 75911 71811 75011 65211 87211 58811 66311 41811 53811 69911 51611 61511 21011 58611 42011 51711 36211 40011 21111 29611 14110 92010 84310 83311 30810 76510 79410 89010 71510 52810 49310 41410 70910 56310 54310 51910 46810 33010 44210 16910 42710 29010 25310 34010 0549 86710 2859 94110 0009 9719 9889 9779 9379 8779 7559 6459 7139 5769 5319 6589 8089 7729 4979 6339 3009 6259 5519 5929 4599 2459 3769 0129 2869 2649 0899 0419 0789 2279 0048 8738 8318 8348 7678 6408 7558 5628 7218 6388 4078 6008 5748 6468 4738 2028 5178 4388 2818 3118 1898 1618 2168 0318 2158 1848 0148 2178 0117 9077 9947 7577 8197 9417 7927 7717 6337 8647 7287 8457 4987 4657 4997 5587 2177 4007 4557 2147 1937 1937 3367 4357 2567 0767 2097 2486 9687 0896 8877 0157 1257 0627 0657 1676 9557 1296 7416 7946 9076 8036 8716 7286 7166 6416 6086 4936 3526 4326 3466 3806 4876 2716 3116 4366 3616 4006 1826 2356 1936 2316 2006 2686 2616 2856 3176 2476 2026 1775 9866 0755 9035 8946 0395 7325 9485 8985 7275 7305 7675 6375 6495 6605 5145 8525 6635 6385 5195 4305 4345 4995 4465 3995 4035 3845 2805 4255 3465 3485 2155 1585 1905 1314 9355 1055 3145 0745 0765 1115 1445 0275 0195 0084 9534 9494 8424 9064 8604 7004 8934 7144 7164 7384 7174 6734 5484 6624 4614 5434 4954 4044 3534 4744 3804 3984 3634 3644 2634 3414 3124 3294 3564 2904 3914 3904 2444 4134 1974 2464 0394 1324 0704 0894 1344 0584 0244 0794 0383 9854 0833 9313 9393 8794 0304 0724 0374 0453 9593 8423 7583 8633 7653 8763 8073 8053 8753 7523 8003 7173 5943 7223 6663 6273 5283 5353 6473 6283 6033 5933 5463 5523 5473 5233 3313 4473 3593 3053 2423 3603 3333 2383 3413 2393 1323 3883 2623 2283 1763 1643 1533 1493 1413 1243 0653 2093 0373 0713 1323 0393 0922 9673 0402 9882 9762 9012 9972 9262 9102 9173 0062 8832 9752 7932 8072 7922 8602 7882 7482 7872 7232 7812 6942 6702 7432 6752 6512 7352 6812 5172 6832 5532 6202 4322 6002 4672 4492 4662 5572 5412 3942 4642 4622 4192 4362 3382 4502 4152 3762 4812 4502 3942 3742 3852 3042 2792 2452 2062 2002 1042 1782 1282 1452 1342 1312 0132 1632 1132 0672 0682 0291 9792 0731 9161 9562 0171 9461 9742 0571 9582 0611 9141 9021 9191 9361 9851 8161 7641 7941 7741 8121 7821 8151 8351 8201 7381 7491 8141 7241 7761 7371 7751 7441 7591 7301 6901 6911 7371 8031 6231 6691 6111 5931 7291 5701 6031 5971 5571 5191 5081 6151 5701 5421 5821 5871 5071 5801 5471 5211 6211 5241 4621 4501 5031 5231 5301 3961 5021 5001 3821 4351 4411 3871 3641 3041 3451 2901 3241 3531 2981 2591 2961 3151 3491 2731 2971 2261 3211 2991 2661 3051 3781 2711 2651 2611 1561 2491 1961 1831 2291 2411 2601 2501 1631 1681 1771 1971 2421 1991 1621 1681 1111 0551 1311 1061 1551 0941 1081 1431 1531 0961 0721 0661 1031 0701 0601 0941 1001 0209911 0361 0931 0859651 0101 0091 0981 0361 0411 0141 0259831 0001 0101 0121 0001 0049411 0089951 0371 067993943956935975900981936914942941910934893893826886857826828878851840858856869842812821828774853850774847845804799888821765807799796830806833796759831819800779800811790786738837790828752760816749766746765750835761790765763752750714818732783772736798757766717753750785730780720766762730709692793773750747726751730698715707670728717698708687687730739736758683729759750725738691790728745756723767712735684780742760665694757698740679704701681696731668670707752749747736744751710753703729769731737738718733698707741699754721747733717726672680705702709700668708667623612610660679633655615655617594590602638585654629620635609621587691589637588528622600620581596628534581592574565569589573624551108 549100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

335 28300000007 263 066000143 584 14800000000087 070 543000099 040 9390000184 165 0170000383 973 2270001 428 942 77700510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %15 538 45399.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %15 515 37699.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %23 0770.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %7 781 25050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %15 361 30698.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

47.2 %7 350 44447.2 %52.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

336 2485 1303 0947 7143 8004 0296 5416 3925 3838 2312 6992 5064 3813 8371 6416 5282 2832 4154 4665 1864 7797 7516 2934 74210 02418 1711 08968 8041 4361 4383 7233 2921 8805 9401 3441 5473 7313 1309958 259112 2705 1274 7208 1796 60314 96418 62316 00860 3134 1066 0695 5066 7583 6816 2956 6325 13721 7415 34910 15314 749 806051015202530354045505560Phred quality score2M4M6M8M10M12M14M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.85%99.85%99.86%99.85%99.84%99.85%99.85%99.86%99.84%99.86%99.85%99.86%99.87%99.84%99.85%99.86%99.85%99.85%99.84%99.84%99.86%99.86%99.91%0.14%0.15%0.15%0.14%0.15%0.16%0.15%0.15%0.14%0.16%0.14%0.15%0.14%0.13%0.16%0.15%0.14%0.15%0.15%0.16%0.16%0.14%0.14%0.09%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped