European Genome-Phenome Archive

File Quality

File InformationEGAF00005296031

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

222 997 667138 001 95662 725 94345 413 58626 035 61119 501 42112 959 5009 919 7217 288 7365 778 6884 555 1763 736 4843 099 4792 637 0932 268 6111 987 8121 763 9811 575 3301 426 7681 300 9791 195 8801 108 2591 027 723962 836902 798847 115798 227753 150712 205678 451642 954617 590590 555566 534542 478517 316499 549479 479462 561444 000429 038413 478398 491383 775371 571361 125349 447337 369328 405318 506308 003297 914290 028281 578274 914265 927257 042251 215244 783235 204229 519224 521219 108212 578209 160204 596198 304193 736190 459184 289179 387174 971169 659167 906162 240160 788155 864151 326148 115144 793142 001140 585135 480133 923131 568127 290124 663121 387119 571116 578114 334111 547109 737108 221106 041104 879102 007101 18799 64996 93494 85092 79792 36689 23688 34686 93584 68183 58082 14480 66378 98976 82176 14373 75872 96672 66670 85669 51868 75967 76566 20265 31063 51463 95162 57060 88859 95059 04159 00157 73856 77455 13255 71154 24252 62451 93050 96350 12249 72649 27748 12147 28946 96345 83545 81245 11544 84343 44642 55042 17641 59641 22640 33440 29440 00139 46438 81837 72237 60137 49237 15236 53535 75335 07534 58934 31534 22934 20733 08632 54431 79231 73831 21630 82430 51830 16730 09429 78228 86629 00528 52228 18127 44227 66327 07627 06526 57425 97125 63825 43125 55425 19324 66524 79524 57523 53923 66123 42522 99122 79422 72122 59822 18221 52921 55121 13221 22120 49620 58020 26720 00419 87519 76219 63519 80219 56318 95219 18119 21918 53818 36017 95918 08717 80017 79817 53517 15117 04717 04316 78716 67416 40116 46215 83316 36115 90715 70615 40415 29714 99114 92614 88114 51114 46414 64214 50614 07214 11713 84713 54413 58213 51713 73513 29313 36413 22413 31513 02712 72012 79912 51112 74812 59212 34412 60012 22711 73112 04711 98712 07211 90111 73811 50311 19311 29611 02611 07111 00110 69510 70010 67710 66010 46610 59510 2969 93410 32910 30610 18010 2409 99710 20010 0939 7439 7429 6189 6559 3799 5899 4919 1599 1089 2539 1648 9389 2708 9128 8138 9528 6528 6488 6448 2428 2918 5518 3278 2038 1258 0797 9388 0597 7887 8347 8547 7467 8647 9887 6577 6907 5897 4907 2847 5487 4047 3507 1387 2397 2567 3056 9217 0907 0056 9406 8256 7896 7746 6146 8366 9896 6096 6176 5776 3076 3856 5286 4346 2686 2056 2036 1356 1276 1946 1295 8906 1935 9485 9895 8125 9495 9045 6886 0895 7746 0475 6175 4835 6465 3935 5645 4855 5325 3755 4865 4235 2595 3895 2465 3125 1495 1525 1435 2165 0595 1314 9674 9544 8584 8814 9134 9094 9384 7154 6394 7554 6934 7344 4764 6314 5614 3624 4194 3684 4104 3624 4514 3104 3354 2784 2864 3524 2604 2924 3354 3614 4354 3384 1184 3684 2904 0684 0724 1724 0733 9864 0864 1614 0804 0583 8673 7663 7703 8763 7683 7173 8463 7613 8893 8103 6823 8163 6793 7103 6513 6303 6673 4703 6353 6873 6403 6613 4793 5733 6473 5513 4563 3903 5113 3993 3773 3833 3873 3523 2803 3033 2123 2983 0593 2543 1583 1233 2603 1803 0983 0803 1743 0723 1613 0813 0492 9902 9412 9933 0843 1423 0832 9702 9592 9433 0072 9372 8562 9242 9172 8462 8432 8372 9062 9582 7192 9332 7752 7422 7752 6322 6552 6112 6502 6442 6202 6842 6862 7232 7162 6802 6362 6772 4662 5662 6732 6002 6422 6332 4982 4812 5522 5202 6122 4842 5812 5152 5122 4242 4412 3932 3262 4082 3532 3842 3882 5182 2682 4992 3432 3442 2742 2552 3882 3862 2782 2802 4232 4482 3582 1972 1832 1172 1512 1392 1452 1112 0862 1812 1652 1002 1122 1802 1582 1852 1512 0582 0592 0922 0692 0842 1762 0352 0792 1491 9492 0231 8971 9831 9751 9942 0031 9631 9941 9532 0221 9441 9541 8941 8512 0011 9172 0191 9351 9301 9981 8331 8441 7891 7771 9191 8061 8181 7991 7801 8941 8881 8521 8121 8581 8561 9051 7831 7611 7721 8551 7841 8051 8901 8441 7281 6681 7571 6731 6801 7861 6271 7291 6771 6921 7211 6771 7111 6771 7171 6651 6441 6371 6421 5891 6011 6691 5911 4901 6091 5691 5541 6581 6281 5451 5761 5671 5451 4981 5761 4661 4361 4751 4331 5401 5701 5171 5101 4571 5201 4351 5251 4501 4131 4081 5201 5111 4321 4601 5341 3381 4401 4381 4951 4261 5091 4491 4321 5231 4891 4051 3871 4051 4511 4321 4331 4221 4641 3741 3541 4091 3841 2811 3391 4151 3511 2911 3521 4211 3501 3781 2501 2161 3141 2611 3581 2981 3181 2681 3061 2851 3131 2891 2291 2821 3451 2771 2711 2311 2681 3391 2761 3711 2611 2701 2351 1401 2201 2561 1701 2001 2101 2231 1871 2601 2011 2241 1811 2091 1531 2511 1531 1181 1251 1151 0791 1981 0851 1291 1701 1711 0781 1631 1781 1061 2061 1121 1451 1431 1701 1401 0891 0851 1331 0441 0911 1391 0231 1151 0191 0061 1011 2421 0501 1181 0931 0001 0529979661 0481 0131 0311 0331 0871 0261 0441 0481 0081 0069931 0209359541 0191 0339701 0169601 0341 0039741 0261 0219969739749641 0099699811 0921 0201 0149369929129299819449161 0379931 0199939849479699301 008990970953973908969939930907927908870886909882918924875930892866934857817862887852875890932919868880896865842887873840887897818822833838857764769848818802850881801826859840784850759787866821819758808780823755740806829838765805763775777791735788804767818826764759759796751811824755750764778777707728787731678740652699675704696708683693725723674715689683758725713687744735744730694691743688707729685670676704687640675615614592594609654626646668667667666649661502 411100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00244 05800000000164 553 6190000000000000310 097 905000000000009 295 102 09800000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97.6 %94 402 08297.6 %2.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

96.2 %93 051 16696.2 %3.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.4 %1 350 9161.4 %98.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %48 382 16950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

95 %91 965 21495 %5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

0 %00 %100 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 038 84117 423 40187 649 102051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

98.73%98.62%98.8%98.42%98.57%98.96%98.35%98.48%99%98.55%98.59%98.2%98.35%99.39%98.58%98.76%98.42%98.46%97.98%98.19%97.78%98.44%97.29%99.08%1.27%1.38%1.2%1.58%1.43%1.04%1.65%1.52%1%1.45%1.41%1.8%1.65%0.61%1.42%1.24%1.58%1.54%2.02%1.81%2.22%1.56%2.71%0.92%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped