European Genome-Phenome Archive

File Quality

File InformationEGAF00002336635

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

155 960 297245 453 686316 160 966354 984 913357 112 658329 232 192281 859 513226 624 040172 719 714125 770 36487 976 58459 641 12239 323 71725 380 51416 078 81910 093 5966 308 7343 950 8652 528 4291 658 2281 129 623802 682595 145462 280373 479310 852265 697230 588205 330181 830163 809147 899134 723121 646111 061102 28191 84787 18579 77274 49969 60563 84659 62155 33650 86747 93546 15442 13239 67637 89335 27233 11931 43129 64828 09027 67626 05924 73223 60223 38522 37821 23020 69519 92619 25318 48018 36317 87617 04416 43215 90215 87515 19514 77413 90313 83213 37813 39512 98112 74912 21612 12011 25811 18311 34611 29310 18710 42610 50210 2089 7069 86210 14710 0509 3939 4289 1798 9158 6928 3118 3138 0308 0778 0048 0557 8227 7987 7557 4807 5337 1297 2636 9296 9576 7106 8066 6596 2316 3056 2756 2175 8705 8205 6475 6105 6555 5895 4555 0584 8785 1155 0534 7674 8934 7844 7614 9104 8274 6644 5844 4094 4154 6024 4714 4634 3884 0473 9954 1093 9464 1614 0114 0673 8583 9614 0293 8153 5913 5453 6413 3913 4803 6003 3233 3643 3443 3663 1203 0953 0302 8972 8892 9412 9102 9872 9892 8582 8052 7542 7802 7542 7682 7302 9222 7442 9622 7382 8092 6952 6042 6142 5152 4612 4072 5302 4982 5262 5442 4782 5102 4242 5032 4212 3742 4242 3922 4422 3922 3822 3422 2572 2912 2522 4542 3322 4142 3412 3852 2972 2972 3092 1582 1832 1582 2022 0361 9921 9961 9661 9602 0221 9842 0241 9011 9981 9502 0842 0271 9761 9412 0531 9261 8972 0221 7931 9741 9241 9811 8531 8601 8431 8011 6711 7301 6701 7011 8481 7271 6681 7691 7521 7641 6921 7431 6131 7311 6691 6521 6481 5751 5801 6031 5671 5301 5531 5411 5181 4751 6251 6341 5941 5251 4731 5471 5051 4191 4951 4771 5891 4441 4581 4921 4521 4121 4671 5031 4481 5731 4861 4381 3241 3671 4461 4481 4251 4031 3961 3341 3391 3151 3801 4501 4181 2661 2841 2851 2861 2631 3161 2821 2661 3631 3771 3751 3071 3801 2771 2051 2111 3521 3031 3081 3061 2231 2181 2431 1701 2511 1831 1951 1691 2911 1551 1781 2021 1751 1551 1351 1941 1501 1071 1361 1851 0831 0701 0501 0291 0071 1161 1181 1691 0961 1191 1661 1591 1561 0671 0361 0491 1199991 1329871 0641 0131 0219589851 0019919889459099219551 009942979900879901914921885957911870863881929824799867795836810834847812807765768787765761808758770785795762783712808792751783745746782783785724673712699734709672698757801714670650679668618700569700586644635613609609587607592628610564598636604557543581574532557547536578541563616582609629551541552621558607584592576571570570597514537566572540593581581512532588536540531601518519509596495522555519562549500507551498481512518511528522490526513510568521512536559510509543507493493529538511494509549487530494454465496476497477457461433434463436415470417425412452465498423466434419465455485467465463442438452411449426433464452448422429397406439407409412420422394412446407389411393371400410380418397396439421344398405374375367384407393376396434400353392422423410371365358380374374367403387404401347359330357342356356383354352333364365341378321339322360340356324360324324264322359319354338310356346360345314337302327294340299298293329306298320270296292270310304303266305336343336350318341334311318290331327294303342305360314314299292313326311322329295316317357312345297321332291328366299332285289291314286312271311304282309294289281249291264290292279289299305287302264255280289282284275275300281279243255268250253271246274267270280252249276269238234256243274250252271249272264261246278270285265252263267226236244240262265249247224251222214227225246212245249248217224252244229221216236241249249234224248222218265234237213227234250233218228234196229234261257206225217240223250214264248250228233249256215217213213249203182201230233219213211212203205193162204203215220197199186219209215205197235239250226201208196206217179187206198171170185203187188210197210172171176212169152180171173196164183166171163158177182187160174167160162167156165173180181186180161172172162160190151185173156174178157163135157166168152169149132152171145171156149150160169143156140236 326100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 610 6380000000136 504 3030001 342 611 871000000000751 995 0660000819 171 39200001 346 259 81700002 536 789 1680009 223 693 37700510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %106 110 62299.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %105 661 32098.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %449 3020.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %53 505 41650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

93.2 %99 721 92893.2 %6.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8 %8 573 2098 %92 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 914 946155 688121 238192 016166 598186 675203 760262 756181 650199 184122 27997 592107 238103 95263 726111 37084 11894 892119 579152 984157 945181 118230 179166 807245 397338 42471 685508 18381 66971 078103 945113 80096 050128 14874 04879 25792 223109 77359 683161 9381 256 122108 366118 806161 562132 249198 891170 674260 965335 59499 49698 167103 711105 29178 955201 622108 46997 811172 953106 074156 202101 112 510051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.61%99.59%99.6%99.61%99.61%99.61%99.61%99.6%99.6%99.6%99.6%99.61%99.61%99.61%99.6%99.67%99.62%99.6%99.65%99.59%99.61%99.62%99.68%99.57%0.39%0.41%0.4%0.39%0.39%0.39%0.39%0.4%0.4%0.4%0.4%0.39%0.39%0.39%0.4%0.33%0.38%0.4%0.35%0.41%0.39%0.38%0.32%0.43%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped