European Genome-Phenome Archive

File Quality

File InformationEGAF00001074966

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

311 761 21768 707 07618 119 3449 302 3846 177 2855 019 9794 347 8273 905 7853 579 3763 324 3223 104 9162 908 9992 747 7032 603 2632 471 1982 352 2522 243 1862 135 3262 046 5741 956 6831 871 8751 790 8651 714 2351 640 4341 574 3611 507 4651 435 7281 371 8461 306 7981 246 1841 185 8211 121 9671 064 3261 007 039952 104901 542850 598798 423751 150704 474661 603616 519575 084536 137496 095461 835427 842396 876367 834341 811315 142290 372267 017245 397225 313206 155187 881169 918155 399141 853129 310117 889107 65696 65488 40580 29672 45465 39858 35152 85447 79943 21639 09535 23131 71628 43525 61623 36220 98719 21417 05815 24413 85112 57911 34410 1418 8167 9687 1586 7306 2005 5774 9894 7204 3984 0153 6793 4093 1552 8182 8332 4692 4902 4102 2332 1941 9921 9581 9541 8611 7711 6721 6161 5131 4701 4701 3661 2791 2041 1321 0751 1211 0511 0249681 00094692286781574677877671872274067362458655554949443044544537839041536632133730929425723124423322021319919319620420120617019615915215716014213214113413314013713613316513515513616413214913514312413013310710210811398911081039688877696836673626270586455505562507365545056615140525160493741434140485042354950453847434245323638332544363937382927392936464440424646444831272528383029242015141812162114228211823211511171415211715912121017101114912914121319845911513956108101198568141185679814118141369613396236583711371071084311885896786671084118710875669101411958887986247611712710127101371110912129129129116712861113148844475176956587366662626466479434235445542779258511566667841081453443417635562352122123141132312212112111211133122152111111122241113412421135252324132131214211122122113322251341212111113122211111121222411212142341112345261413623122211312312111222112111122113112111411111321111221131231133622431343231112332316213332321142311123131122122222121121211213131121211133112122224222221211221321312231111212122112180100200300400500600700800900>1000Coverage value1101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

66 903000000000000061 844 49900000003 638 703000059 347 37700000231 575 3550001 899 438 81300000510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %30 075 729100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

100 %30 072 810100 %0 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %2 9190 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %15 039 41150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.3 %29 877 97299.3 %0.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.9 %1 783 8085.9 %94.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

780 6419 7414 38415 3365 2574 69110 4268 3084 56114 1635 8134 64110 4166 4073 84912 0524 9034 8615 6287 4026 82112 44310 88910 02413 58537 4222 351145 6542 8645 5246 4774 2591 98311 9271 9272 5173 1055 2301 82114 213265 0768 3976 82114 37810 79217 20614 77015 67315 73132 84739 77331 335159 9633 95628 47611 2315 59749 2993 5442 20528 249 130051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M28M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped