European Genome-Phenome Archive

File Quality

File InformationEGAF00003612351

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

21 596 17119 403 64516 621 30413 731 3289 149 4916 636 6263 758 9992 955 4401 520 5931 430 955729 867800 158400 051464 728255 061306 486194 844212 110150 590157 545124 581126 539106 222107 50192 43992 99384 40680 33073 63373 76870 55766 57765 56364 82661 96857 99958 32156 22954 63052 40951 06450 53949 67748 43846 86146 53144 86544 96943 18543 12441 45840 63040 13740 66838 51738 86938 36037 81038 06735 92336 03536 04835 39534 88835 07933 62432 97333 37133 57832 23732 13731 89331 01231 00830 10630 30230 59329 79829 11629 11429 29829 24028 33227 87827 11327 15926 80626 28626 60526 26626 39625 49925 69625 64125 23425 45025 11524 25125 11324 77423 92024 63224 46123 97123 93323 35823 88223 27822 78323 09222 59822 57523 05522 28921 55521 61222 06321 79921 88721 21421 56221 36320 49620 62320 78120 82919 97019 78620 19420 11719 36819 65419 32519 12319 09118 64519 13918 71518 58318 20118 53218 70418 43918 33817 75918 30818 39318 25217 67317 77017 39217 82817 70217 33817 17816 70816 96317 21416 64516 70116 76816 53316 84916 49616 55616 25916 22516 22416 05916 29315 99815 78815 51515 32215 39515 33715 38315 21514 84515 03114 86314 73414 65814 40614 67014 56514 34114 51014 42714 58014 64114 73814 24714 10114 20414 12914 25013 56714 03213 73013 69013 81113 62213 59214 04413 43213 23913 15313 10413 31213 12512 81313 28012 99612 72912 66012 58812 60612 64212 38112 32612 23512 19312 00812 06911 99711 85111 82911 92411 75611 92411 55811 90311 73811 42011 41511 48411 33611 20411 32711 49311 33011 24511 05811 03411 06510 82710 65311 16011 11310 99010 54311 09210 83010 47510 79410 32510 56310 22510 29810 22310 43610 09710 2379 9439 84610 0019 9189 97310 0529 7579 7419 8119 5579 7349 3369 5409 4689 5439 4319 4599 3629 3679 2829 0129 1589 0098 8669 0298 9518 8549 1819 0708 9008 7608 7118 3978 8388 5308 6438 4138 9198 6778 6568 5948 2488 5458 3778 2438 2848 0558 2228 0698 1117 7578 0857 9657 8667 9197 6967 6487 8317 7507 6497 5797 4967 5237 4747 2937 4047 4147 0377 3987 4317 2347 2227 1306 8887 0047 0346 8376 8476 7736 6936 9556 6046 5036 7546 4596 6476 5726 6966 5036 4456 4826 5466 5376 4936 4626 3566 1006 3036 1906 0356 1405 9675 8716 0636 1065 9425 9975 8225 7035 9215 9375 7305 6905 7065 6365 6675 3555 5865 5655 6115 5825 4055 4475 3775 3315 3665 4545 3145 3515 1125 1685 2075 1105 0475 1484 9114 7575 0134 8394 8254 8594 9004 6964 8144 9024 6944 6254 7304 5464 5894 7124 5544 4824 2934 4244 5514 4564 6264 4054 3974 2494 3044 2634 1614 2814 1724 3214 2964 0814 1734 2514 2163 9954 1874 0094 0583 9304 0903 8563 9583 9323 9143 8443 9023 7683 7443 7353 7023 7173 8003 5893 8083 5993 4713 5313 6413 5063 6183 4873 4993 4743 4743 4053 4333 4533 3583 4073 3893 3013 3203 3123 2523 2063 2653 2093 3403 2023 2483 2813 1573 1633 0933 1153 1133 0983 1333 0043 1002 9113 0422 8522 9412 7712 7872 8552 7232 7452 6322 6802 8032 6002 5142 5402 6552 5992 6742 5452 4682 5242 5292 4392 4432 4162 3272 3472 3172 3222 3722 4172 3862 1652 3032 2612 2452 2952 1772 2922 1542 1832 2592 2802 2792 0502 1572 1722 1832 2632 1762 1122 1252 1132 0402 0602 1832 1212 0102 0801 9942 0051 9402 0231 9711 9681 9731 9442 0121 9392 0001 9151 9611 9612 0681 9301 8161 9091 9081 8771 7941 7741 9771 8711 7921 8521 8581 7741 7961 7741 7261 7301 7571 7411 6881 8441 7541 6701 7141 6091 5661 7631 5851 6591 6091 5691 6361 5741 5431 6381 5681 6781 5351 5341 5661 5641 4641 5001 5441 5011 5461 5451 5231 4021 5571 4021 4811 4561 4251 4991 4521 4401 3671 3711 4591 3601 3501 2931 3661 4301 3661 3151 3001 3071 3921 3541 3351 2641 3471 2571 2931 3181 2381 2261 2421 2291 1451 1821 1731 2401 2411 1881 2951 1831 1311 2031 2331 2011 2691 1971 2791 2531 1151 1891 2251 1821 2311 1171 1851 2051 1431 1471 1551 1401 1331 1551 1551 1861 1371 1621 0981 1681 1411 1861 1741 2141 1421 1621 1811 0961 1521 1311 2101 0611 1071 0961 0821 1071 1391 1951 0951 0371 1231 0801 1051 1281 0151 0241 0599991 0421 0661 0501 0181 1041 1061 0389911 0591 1131 0731 0901 0391 0691 0241 0491 0591 0221 0271 0371 0591 0841 0341 0019621 0139739711 10498298596493999697287691995591392694997486585989689391087781890780583391785084985981986488388381985480884886584987382880885384583881485378679476783683782985080990283680282483879479880479376978778279479174474276278978075780476279070174576173476175178375274576275672968669572676869771773970373271970870367873570770066667762269168662767569069768764359569164766565566965664663664860256759059564057162262864261257560362859061958663358157263556956960165560462754258558453152755555352955955554755356351350852351055254351155752153651453752849952249350148350545544950048546145044644340142544442146144045544044142144243744341843445239942942541040240541936438637440434735936936437636938134835132936734735235731933935430934333733429030030332034529228329831530029624928731428627228028428726126630 884100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

222 36500000004 748 658000102 075 05500000000061 194 036000068 718 1390000135 830 0280000286 659 9360001 238 627 18300510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %12 613 06199.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %12 573 05699.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %40 0050.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %6 326 91850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %12 471 74898.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

71.4 %9 034 67071.4 %28.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

289 0203 5012 0925 8882 5302 5364 9284 2633 7115 4731 5951 4863 3802 5659704 6581 3841 5303 5014 0103 6655 8395 5323 6707 60914 73478452 8089209682 9822 0791 5764 2307798942 7732 4166656 16685 8793 7913 4466 6295 05511 10514 51610 27650 9533 0764 6743 9865 7282 6565 0014 5953 85516 9764 0317 55811 979 237051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.69%99.66%99.68%99.67%99.68%99.67%99.69%99.67%99.68%99.7%99.69%99.7%99.65%99.69%99.67%99.7%99.69%99.7%99.72%99.69%99.64%99.68%99.15%99.82%0.31%0.34%0.32%0.33%0.32%0.33%0.31%0.33%0.32%0.3%0.31%0.3%0.35%0.31%0.33%0.3%0.31%0.3%0.28%0.31%0.36%0.32%0.85%0.18%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped