European Genome-Phenome Archive

File Quality

File InformationEGAF00000658742

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

83 324 76933 708 43510 779 9104 969 3502 016 0531 316 293730 700540 432363 978284 031212 290170 931139 344117 95497 08884 06872 32664 74057 57751 22545 93541 60238 21634 88432 81129 95027 93426 93525 38823 18521 89221 19720 66919 22218 74717 60216 92116 84215 89615 46115 10514 86014 49113 69913 62913 21312 95712 93112 62312 35311 96211 80011 51611 39710 95311 00310 90710 44310 35510 1759 9679 9799 7409 8119 5429 2659 4519 0989 0129 0028 7748 8158 8178 6518 7528 6558 5468 1298 2088 1858 1917 9938 0148 1487 7127 6287 6597 4387 8087 5477 4867 3547 2157 3797 2486 9916 8396 7596 8426 8246 9106 6816 5876 7166 5926 7156 4466 3656 4466 4976 2516 3706 3716 4326 3786 1236 1586 1535 9616 0845 9255 7865 9095 6955 6805 5815 7515 6115 6745 6455 4545 5485 5575 4875 4985 4035 4675 3475 2885 1815 1455 3005 3475 1225 2175 2615 0295 1325 1725 0255 0714 9964 9765 0945 1255 0034 8734 8494 9434 9524 7764 8994 9444 8694 8804 7584 6304 6864 5324 6044 6804 5534 6124 5594 4574 5434 3914 5294 4114 3554 3824 3644 3124 3094 3304 2984 2834 3034 3414 2614 2004 2024 2434 0604 1124 1814 1893 9953 9223 9883 8943 8423 9873 9143 8713 8903 8643 9113 8273 7803 8903 7993 7803 7063 7223 6133 6973 6683 7113 6453 5633 5633 4923 5563 6633 5803 5113 4683 5033 5293 5363 3893 3973 3143 2433 1533 2153 2633 2183 1533 0913 0463 0763 0153 0282 9642 9112 9332 9412 9192 8762 7992 7332 8442 7202 6372 7072 7062 6582 6512 5872 6522 5572 5582 6022 5942 4462 5122 4572 4332 4032 4342 3752 3342 3532 2872 2172 2712 2172 1752 1712 2552 1062 1122 0602 0692 0732 0492 0272 0301 9882 1231 9822 0412 0511 9642 0021 9851 9271 8731 9001 9241 9091 7411 8101 8091 8341 7931 7951 6771 7551 6431 6531 6491 6281 6401 5921 6181 5671 5581 5041 5351 5021 5221 4801 4441 5731 4731 5221 4411 4681 5251 4811 4591 4481 4471 4021 3081 4301 2351 3381 3161 2641 2431 2471 2491 2411 2181 2351 3321 2221 3001 2391 2021 1981 2151 2421 1431 2071 0791 1201 1191 0711 0841 1321 0571 0981 0681 0169821 0729959959589819959909709059729438718658478608418297917958067938517717797457568077356866896867456966636876446376425996015456275855855315345495664765455145104944694554724584214444323803994094013414173673643483283343643273363272963122982852922783082732962722792742722862492582432482582672122352362132412262262072232002002082011811961601731471711831601431451561341131281221361421341131251341331341321271021211039310297889310010798939588118105858894867678949361677664815555728053526967796256662953324446314340383727332433353653352342304028452522433539312939243727233030313727212830363132292230312828272931192727282831272432273135253439272928301727251731212821252322241613202313181314121212241515161315161716132313191720211918141723201952015201715151510141512221881416141916910141716211116261281713152615201919152016101921222025111215161721141218171415991514141012141317131420141511111215121212812111314161518818123101115141913713615131518910141116111185111312911139111117121377810117891210131115131117121213131013611231610171214149151061265810871287131374101271076514158141212151315221391679171111816713121181314111410131318911161513191518151313142116121619181217161316151520142114141311131313381313121623141115157561396357101146125108796121311684569497107455611698979810157678267103825384522333832 255100200300400500600700800900>1000Coverage value1101001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

210 6300000655 339395 4132 165 5551 782 210564 390986 495388 029410 292825 820285 882875 042718 0431 285 9801 423 412744 7131 166 436758 2071 420 2161 916 5332 043 5102 383 0913 705 3704 296 4483 518 2103 700 3128 293 95712 844 5208 764 21913 159 61627 326 32934 608 85419 778 32151 548 66337 643 46462 212 55673 811 892135 813 53100510152025303540Phred quality score0M10M20M30M40M50M60M70M80M90M100M110M120M130M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %6 922 95099 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %6 899 62698.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %23 3240.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %3 496 21050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %6 881 64098.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

24.7 %1 727 29024.7 %75.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

317 2061 1816132 0909311 1161 0501 6231 4538 3635 8432 64510 8562 6241 80530 3343 14811 8129 8382 95117 35759914 80054 2114903 196421362570267 5578647568049987001 31632 675116 6803 7021 1384 3023 5269987 6481 9542 47227 3804 0704 4145 6909 3304 05411 8748 84013 23022 80641 6925 881 462051015202530354045505560Phred quality score0.5M1M1.5M2M2.5M3M3.5M4M4.5M5M5.5M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped