European Genome-Phenome Archive

File Quality

File InformationEGAF00003611395

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

529 842 221226 058 54083 213 55940 695 00919 204 93112 212 2868 118 8036 458 2985 318 6564 708 3714 259 2693 933 6193 665 1703 453 9063 274 4193 119 2962 980 2312 858 2822 742 9472 636 9412 548 2422 465 8952 373 2072 290 5042 222 2572 159 2582 087 8862 035 7831 969 3001 915 6451 854 2551 800 9521 748 6861 704 2871 656 9841 609 3051 566 5571 524 0161 479 0031 439 4131 398 8001 364 0831 330 2411 291 1631 260 6431 230 0061 198 5121 168 6711 141 3271 111 5551 081 9581 052 8341 019 761997 673973 588945 579925 486903 073881 028856 568834 698819 264798 639780 861758 882741 955719 921704 167687 089669 348655 612636 405623 113609 479593 433575 500561 648548 295532 934518 142504 358492 033480 308467 914454 529441 716432 307421 032409 429397 631389 713378 789370 423358 544347 574338 699330 946320 410313 396306 612297 291290 256281 361273 782266 255259 645253 398248 626240 750235 335229 790221 688216 016209 618204 245198 671193 018188 048182 326177 024171 337167 484162 038158 515153 645148 589144 929140 303136 351131 381127 379124 220120 474117 403113 420110 508107 105104 503101 40698 29496 20993 03090 64386 54884 48281 18578 94277 26573 88472 32770 28768 24266 08864 17762 09960 10657 18856 71754 97552 87251 80750 38848 22447 46945 86644 27542 98841 25240 21838 56537 90136 89635 89934 32833 42932 74931 61530 71829 71028 64928 12326 73625 83825 19724 35423 36922 45621 73721 11920 51720 00119 26418 53618 24517 63517 03116 46515 57715 61814 83414 42913 95713 70013 34912 69212 35911 95111 56811 07110 83210 47510 0809 9699 6759 3568 8768 7068 3508 1017 6317 4407 4317 2206 9336 5456 4356 1556 0545 6815 7215 3885 1115 1064 7314 8364 5984 5644 5004 2854 1004 1223 8623 6293 4773 4033 4263 1903 1173 0543 0053 0152 9652 8272 7252 6712 5292 5122 5122 4562 3142 3632 1992 1602 1672 0672 0141 9271 9601 9731 8641 7741 7761 6541 5551 6031 5841 5661 4551 4351 4671 3931 3071 2651 2541 3091 1971 2581 1691 1581 1471 1831 0871 0631 0471 0399981 0251 0199379931 0109271 02998195195594088391692491581981775876477477371868172671674867666365472459661161158458259953153653552751550746142849948749647149748347042553446839042238142343141938439734737039738444638434537532433731833030235645032634332336135033235132932833630433432431134127931530328330926327129127526227525725325725823626025323619720722923623023924021124223320321721522121722922222021319623119521717529622116518522522721719819720821821519820918317521218520318320422017417017618020719718119918620318115818619621518020119319318719319827122718820932319218918719321518722118019620419217015417719720118418216119018419119119118320918015917317019616921121319317519615015317117016516617016216015615915813415114215214414514916314112814514314913014211614013815714412514916515516115013113114414214112813416914615416412416714512613712913012514015411812614913314410912789102931068710510311111510910012994105109878310910812011199100105133125128143131120135131118129125127147122113150142135152137125137143128157143127124118160129153116141139131144142140135151149143131138120133133127128123127139124131115144135141118150123139141122112121152121111122134115124899311599961001061019379103908694889591817280726971856992768210714481776665646286716169726270667365625256535964456459586556555637594655624848504345483850423443485642484334464849453449375536283830344055403044334244224124292922382417322327284226323924272921283323232628191223201718101413411141318171612111681414819119141517101514181510131911151111101261818151618181412101115161716481115139211879813512114891816841398121397787610151112102112122026172521301921172220292726162522544164421418131111914121116101412111313714866713711720111013814618138179121110115965109565476419712468144572111 916100200300400500600700800900>1000Coverage value1101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 281 310000000054 899 420000444 093 648000000000282 027 8050000350 157 2110000609 327 13600001 188 547 1240004 642 084 24600510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %50 376 15699.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %50 276 01699.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %100 1400.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %25 241 39350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.1 %50 011 75499.1 %0.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

30.4 %15 321 53230.4 %69.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 022 35411 7896 86318 7248 4369 03615 39414 06113 01418 1985 8884 82310 4058 7494 10515 6855 5996 18812 16912 43311 38318 88415 88813 66425 30649 7782 538236 3003 4203 1459 5946 3994 19515 3583 0583 6118 4498 5782 14821 253317 31613 64511 28620 44214 01232 58441 71041 672213 68610 66016 65214 42320 0869 04716 12914 87312 72765 02114 24327 97948 010 264051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M45M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.8%99.8%99.79%99.79%99.79%99.8%99.8%99.8%99.8%99.8%99.8%99.79%99.8%99.8%99.81%99.81%99.79%99.83%99.81%99.81%99.8%99.45%99.76%0.2%0.2%0.2%0.21%0.21%0.21%0.2%0.2%0.2%0.2%0.2%0.2%0.21%0.2%0.2%0.19%0.19%0.21%0.17%0.19%0.19%0.2%0.55%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped