European Genome-Phenome Archive

File Quality

File InformationEGAF00004856820

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

86 331 848162 324 602238 931 093298 269 763328 504 373328 671 021304 445 619264 503 540217 758 226171 416 713129 850 33795 127 80567 672 96047 110 49832 110 05421 604 54814 333 9519 405 2116 185 9854 102 6452 750 2221 893 1471 342 605980 301745 490593 092480 056405 394352 311309 961278 924249 767226 261208 509192 166176 432162 989151 659142 376130 876121 358113 169104 94997 80690 40485 60980 35075 69669 80166 39161 92157 96555 16652 63449 03146 97543 83940 96439 35538 66936 47134 79633 71432 67331 21630 18629 22028 23727 47526 07725 27824 55423 83123 42223 16922 45521 47021 11820 06619 68918 87118 83018 02817 39817 29216 54416 70115 86515 61915 18114 52614 19713 79013 61313 28412 77512 37611 97011 85111 34311 16410 81210 87710 81410 68710 35010 51510 04110 1649 8529 7079 6889 4399 5829 1048 9458 8878 6398 2708 4368 5578 2168 1757 8207 6717 5767 5837 2907 4667 3687 1667 1567 1616 9106 7846 7356 8816 5336 4046 3096 2376 2846 1296 1265 9635 6465 6905 5675 5905 5195 5505 4455 4665 2224 9825 1325 1185 1334 9784 8404 8174 7984 7984 6664 5704 7514 5994 4234 4634 4744 3284 3624 3744 3334 2734 2924 2304 1174 1634 2673 9684 1713 9443 9613 9493 9843 9293 8373 7973 7533 8353 9303 7893 7973 5573 5303 5313 5243 4073 4863 4483 5763 3813 3583 2653 2633 1613 1153 2543 3403 0862 9863 0812 9442 9472 8772 8562 8302 9392 9122 8422 8452 6982 6652 7482 6482 7322 5842 6132 5852 5872 7022 5122 4912 4892 4152 5702 4842 2792 4692 4062 3822 3712 3952 4672 3982 4422 3412 2712 2992 3062 2062 1812 2822 2132 2792 1012 0812 1342 1852 3182 2422 2472 0692 1242 1512 0712 1352 0411 9251 9661 9751 9242 1141 8671 9051 9931 9891 9851 9071 9021 8871 9431 8851 8691 9121 9601 9191 9131 8791 8401 7261 6331 7671 7271 7261 6971 6701 5961 6511 5751 6781 5631 5531 5571 5871 5821 5891 6051 6431 5521 5471 6361 7241 5561 4931 4961 5621 5921 5571 4841 4561 3431 4441 3341 3741 3201 4441 3511 4411 3851 3511 5111 3361 3641 3531 3121 2991 3811 3451 2701 3251 2981 2531 2651 2771 3391 2541 3221 2811 1861 1661 2651 1771 2011 2091 1961 2341 2251 2901 1271 1391 1791 1781 1171 1311 0871 0281 0851 0791 1021 1021 0901 0981 0761 0451 0681 1561 0771 0331 0361 1001 1141 0291 0071 0189599019739289429469309941 0049779549129228971 003854917881867904950890910929950954932920974908936913922936914862866854871792829829780791801793804842820776780823786861873795833829809766787762819767742799765791795782799825735764768734748745817780856810803748807756737750724719716693702704692686654679691737657707712794730683727678644682691670640601668661639692659636605625631614637618589592590604619590605553607662598630608664678606550653590625608611554582536576600591564563553566518526516497494536473480496511499525487488489549515530543528517548515463473522510519497478506530506467464432464485493457470428464481540526496533488454498463486496499436492492453490449465466469504490442435487475478450454464434443455432445429479464422429434462442444450396459447465462456451444458464488466467405438418431394415445386436435432441431443456438419466464466471446470428401393456419432427424411466418449450487467455428441418418386413405400417376413441405406391416397365413487429375410375430393405409390409406397394401431377391381432365391391345348361365370362341341367347369349311381346379316366371320323361348351337333367357348405343371352359343332355345342407312370360367349359348323341319313363361309333320321349332332289353323353463327328352298301339354346320310305302290323307310314313258283282298286317276302288256286261276318267269269273277256276238273271231237251271262270232254254252222254220237247235258276261240249310253277259297255245260261274261291244236239225270248243227211232234214222226234209214217229262235252195202278242241249239232232249232234231229220242214224259212241230242234228230219198237229222259271202219223246216212207193235201204226221240225198189215225208228194200216213237215199200207220195210197181191212200167203184195190164154188208184190212172218234245217202200195221231155182171183192197174172178177205197201193204206220211281 667100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 356 1710000000122 028 8090001 407 421 949000000000684 372 1960000723 905 42100001 526 269 28400003 335 657 58800014 218 568 03400510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

94.1 %137 272 28494.1 %5.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94.1 %137 165 21294.1 %5.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %107 0720.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %72 915 82650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

92.5 %134 890 35892.5 %7.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

20.9 %30 449 21620.9 %79.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

14 415 680132 40679 443156 455116 429121 396132 894181 26282 513142 38467 65659 19680 07294 37850 992110 79575 89284 245114 430158 617164 420159 819204 361166 870265 875419 31225 075699 93137 99835 91467 66275 05733 47291 95138 42438 29158 98178 07321 815117 1881 709 27983 95780 744137 855121 000223 343185 185306 640437 94456 93869 32363 20878 51639 05377 87573 87553 382186 17056 709112 402124 052 649051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.93%99.89%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.92%99.92%99.89%99.92%99.93%99.88%99.92%99.92%99.93%99.74%99.93%0.07%0.11%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.08%0.08%0.11%0.08%0.07%0.12%0.08%0.08%0.07%0.26%0.07%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped