European Genome-Phenome Archive

File Quality

File InformationEGAF00003613788

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

8 463 9903 816 4303 505 5154 079 0764 502 1245 221 7895 244 2435 382 8664 715 9394 325 8513 379 3762 922 2401 991 9961 856 6801 085 7301 213 439631 570893 831418 960699 100315 283540 407242 912420 335184 881316 670151 269230 665117 204191 066103 475137 99189 831110 92677 45096 34570 56281 77358 88670 62058 00663 92551 24358 53649 65253 88446 29748 23841 54744 09042 16841 07639 91838 52535 79937 61735 23136 46734 55235 33032 25931 07031 25731 86330 83830 35328 92327 73629 17928 69428 21329 05928 62327 61426 98227 39926 16026 50926 47425 37426 64325 49924 80724 18724 33024 27125 00424 06223 98023 02023 45123 72321 95222 26821 87221 56421 42221 93021 75821 98121 09321 31620 66719 98520 21920 25519 02419 64220 61219 45919 30618 42218 80419 03419 62219 05219 22318 46218 16518 93017 58818 49517 66916 46617 25217 48617 58617 61417 17017 20016 46116 54116 62016 66516 20516 38016 84516 18416 38316 06715 91516 44814 81015 25315 03215 09015 55715 21215 03214 76114 90215 08514 60314 84814 33313 61614 38214 90413 70013 84814 80013 49513 39013 15713 51013 91014 27213 56512 87313 71413 56113 21213 25812 77212 21513 30612 95113 08313 11812 90112 94012 93012 52112 55312 74112 44112 63112 28712 50212 85811 79012 10611 95412 08612 63612 12611 53411 88811 65811 74011 09211 22711 12511 79211 68911 06011 49410 84411 13611 22710 84411 19111 08211 10710 51210 40911 07211 31310 61410 66010 36510 46710 25010 41510 56810 20210 54810 15110 11110 17310 18610 10110 3009 7479 78010 3489 89210 0289 97210 0079 7419 4279 7488 9209 9509 0829 4449 5639 1719 0609 0268 9439 3109 1329 0089 0148 9869 1399 2228 8138 6579 1368 3638 6888 4718 9408 7938 7308 4488 4978 3688 3548 6548 5488 2628 3988 2268 1588 4258 0657 9598 0288 1238 1157 7807 5867 8078 2747 9837 7747 8937 4187 9507 3327 7647 7637 7277 4897 7077 4607 3867 6147 6847 6957 2837 0837 6417 6697 3227 1477 0157 2527 3097 3697 1136 7917 0656 8946 9376 8927 0367 0487 1797 0256 7326 7416 9676 6557 1436 8736 4097 0846 5936 8016 6926 7786 2976 0486 8616 7706 5336 4946 4526 6006 3386 4556 5686 4306 3826 0696 0306 0996 3776 1056 1926 0226 0475 9415 9145 7995 9045 7165 6995 6595 8245 7645 9435 4475 4545 7345 7405 7485 5485 7515 3725 6285 6295 3305 4115 8945 6065 4555 3795 3345 3605 4285 1855 4555 0944 8485 1034 9704 9124 9334 9165 0164 6754 7674 9554 8064 7294 8184 8394 8284 5994 8364 6174 8184 5104 6744 4674 4584 5274 5234 4704 7954 6134 4834 4154 4474 6384 4024 3624 2714 2644 5224 1934 1244 1004 2864 1644 1724 2804 0534 3434 3204 1764 0203 9844 0363 8674 2154 2384 2293 9863 6854 0893 7773 8483 7703 9803 9164 0453 7593 7433 9623 8153 7653 7763 7943 5863 7363 6923 6893 6943 6753 4693 6583 5503 5173 4013 6263 6413 5313 4433 4803 6273 4353 3883 5363 2993 5603 5563 4133 3183 1343 3393 4093 2793 4073 0903 2443 4733 2763 0743 0993 0983 2013 1693 2772 9612 9093 0543 2073 2063 0423 0173 0243 0103 2002 8943 0632 8912 9713 1023 0462 8692 8922 9652 9182 8862 8522 8512 8622 8152 9512 7612 7692 9032 7722 8512 7332 5512 7272 7542 6682 6052 6972 7312 4832 5802 5622 6802 6672 4742 6302 6172 5002 4512 3852 5232 3252 4762 5872 4982 5202 4562 3682 4782 4972 3272 2772 4122 5352 3812 4692 5202 3922 1712 4212 2832 1772 2922 3972 3622 2702 3202 1432 2132 1552 2642 2642 3912 2422 0912 1772 1722 2412 2752 1232 0302 1572 0692 0392 0562 2542 1542 0292 1702 1091 9972 0201 8972 0412 0942 1352 1001 9492 1421 9041 9851 8841 9841 8291 9641 9242 0211 9871 8792 0752 0521 9201 9221 9821 8821 8401 8881 9601 8081 8591 8421 8461 8801 8321 8101 8751 7241 6911 6561 8701 6461 7951 6631 7541 6581 7401 5691 7551 6911 6901 6791 6721 8251 7601 6201 6231 6801 6981 6681 5841 6911 5831 5691 5541 5831 7331 6171 6421 5041 5211 5011 5261 5211 5801 4811 5291 4031 4431 5931 4811 4621 6111 5031 4731 4041 5501 5541 4791 5711 4601 5331 5151 5191 3321 3751 3871 3051 4211 4491 3851 3431 2811 4991 5431 3151 4861 2411 3991 3711 2941 3951 3771 3931 1661 2851 3681 3461 3471 2361 2451 3111 2691 3111 1341 3031 2871 3111 2921 2251 3591 1671 1681 2741 1981 2871 2341 2131 2691 1891 2061 1371 1151 3151 2021 2491 2181 1691 1311 1831 0901 2611 0401 1449781 2421 1691 1961 1611 1051 0971 1461 1261 1059941 1089761 0861 1569931 1411 0261 0841 1451 0741 0509811 0321 0751 0389811 0119719651 0239839739729899549258839189889029888989639319459259769149309639308958808041 00591186988992895284780486490681380089292392191085079383087582687077377479174682682185284179266183074772879376877375279278679667874680170973474479267771276372869971968873273669465472066860171465564165968965958664867472365062262568765250453658962759059260362554658058362065064253458755360958354157858158259653861652664553552855950647150849854552557556051555156650050248051849051751250351255649348153546848342946758950849348648149251548842148243245148946442747543245138141445742446944145438642344144136645640741939742139637440437836939542342036536639140939532136832334732041338677 077100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

228 81900000007 796 425000124 521 22600000000077 166 592000088 581 9990000166 000 2920000337 239 0120001 279 490 93500510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %13 857 54899.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %13 842 38099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %15 1680.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %6 936 75150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.3 %13 639 95098.3 %1.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

89.1 %12 367 16189.1 %10.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

323 9164 7223 1796 4873 5443 4275 9005 3946 0357 5263 1122 6543 8813 8801 6465 9951 9182 2623 4823 9393 1956 3795 5564 6507 81717 2481 39758 2101 4721 6923 5462 7562 6875 7071 4581 6103 4303 4161 3007 85198 2124 6984 4297 7836 76812 85917 12213 73868 2284 4327 1475 5658 4653 2886 5545 9825 05820 9975 62211 17513 185 396051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M12M13M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.89%99.89%99.9%99.9%99.89%99.9%99.89%99.89%99.87%99.89%99.9%99.89%99.9%99.86%99.89%99.89%99.89%99.9%99.88%99.89%99.88%99.85%99.94%0.11%0.11%0.11%0.1%0.1%0.11%0.1%0.11%0.11%0.13%0.11%0.1%0.11%0.1%0.14%0.11%0.11%0.11%0.1%0.12%0.11%0.12%0.15%0.06%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped