European Genome-Phenome Archive

File Quality

File InformationEGAF00002492768

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

160 823 829277 082 878366 783 779405 356 552391 340 823340 077 094271 399 972201 851 553141 634 83194 484 14160 512 92137 492 03022 590 44513 427 3007 940 1644 735 9312 886 0431 831 7051 227 440880 343667 919526 750432 604367 157318 204285 503252 750228 694205 053186 611169 668155 749145 201133 380123 813111 946103 65597 84391 01484 44178 13774 96970 63166 33362 49459 39756 24254 65651 90050 05747 92946 13944 27742 79840 70040 32838 77937 01734 86934 18033 14032 22131 89030 14229 49628 65227 41426 41525 70724 87623 54223 68822 42821 73020 99420 48319 77318 98519 16818 25518 09617 12216 52416 13415 72015 43215 59215 31015 28414 86314 02314 21313 76613 21313 00112 99512 48612 45112 75311 99211 86911 47311 46110 99110 85810 51710 52810 05010 24110 1259 8949 3679 6959 3449 1209 2928 9959 0409 0058 5657 9478 1528 1397 8557 9527 8177 5797 5027 3317 4127 0667 0916 9466 8426 9456 8236 5976 3146 1296 2146 1725 8816 0465 8086 0215 8775 7495 8245 6715 7425 3805 5155 3675 1565 1705 0074 9504 9635 1085 0634 8294 6504 5934 6724 6084 3334 3754 4044 4414 4684 3634 2433 9483 9934 1484 0623 9604 0444 0813 9063 8933 9363 7503 8353 8513 8293 8253 7923 5703 4703 7113 6273 5773 4673 4753 3293 2003 2363 2173 2253 2053 1523 1373 0733 1883 0153 0923 0083 1022 9582 9632 9282 9063 0342 9472 9312 7832 8022 8132 7692 7952 7022 7292 7282 7142 5672 6282 5382 7242 6282 7072 6142 6142 5302 6532 5832 4392 5262 4052 3702 3952 4302 4782 4652 4542 2742 3862 3312 3512 2792 2592 1312 0532 1002 0842 1372 0082 0272 0202 0491 9321 8922 0371 9731 9101 7911 9251 8641 9021 8611 9441 8971 8681 8691 8341 8111 8261 7911 8241 8721 7021 7631 7721 6961 7721 7761 7221 6211 6261 5841 6531 6471 5801 4791 4781 5731 5401 4291 4721 4491 5261 4781 5821 6041 5711 5601 5221 4631 4231 5721 5201 4811 5151 5241 4911 4371 4051 3901 3481 3371 3401 3691 3591 3431 4201 4371 4101 4381 3581 4111 3581 3371 3211 3361 2751 3171 2981 3381 3441 3211 2421 2811 1751 2541 1961 2401 2141 2871 2211 1641 2121 1861 2021 1761 1531 1751 1471 1681 2251 1701 2151 2421 1361 1651 1611 1371 1051 0721 1381 1201 1641 1071 0991 1191 1101 0511 0751 0421 1161 0371 1321 1391 1271 0599921 0351 0391 0189899699309509381 021903912898941916930917939893922888879868944846917909876839857783851870835861829828789784861832809826825758785790770768813762776798829750864786844785808755749762749782690673717667654606630725671696682683649673691634650633651671675662633649647639652712732683644649657594629558600648642587574608605611552585562559542562549521493548541505552544529537519539544562534558526511512529515447550505538554487472484538477555515491486502508515512476484533503458477518497491500465471495503496471481467461443469466488443443453436446483413437445393426469494448447424523517481465464469438484497508495567507483541602511538492523513476467506434452476462451435444459499473440444452426442403433447416421428404373425407365363394434378410425408423420379395379384366367394372382369400358374373362381384372401367385373398385408395367393454407374422406390380392407433404395379448394368391391421361400383363405370388371432386355354381374356372370341355344361375363332388383373341384368371359314321378313327330340349346392376358326337333300324333338295303327280293314313330301318296344321362306324330319339348332329363333339358355367315323331310338333314303298357310337334334346329313315353304314311327318320298326320322289319279321295276287287302282275277259269278267296270280278264266256257292268281256242254273264289258275245244252246250264261250252258252289240271279282275263256282274236250256250241238219203238266233254232244227210232236230205223246230207217252236212213233194216205253219226237225237192210258218245210203212207235230200224197194207217202194196206184219216212202192228228207214202217212194172199186219187188185192184171185161192179177203194188207159184191181179166194196193165176202183206169205177183186175172196177197154189176172168173175169175157196177163192198179196159178191188172176151150175161167134187145139162168163132216 127100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

816 119000000046 281 945000832 892 128000000000488 492 9060000533 717 33600001 118 317 53500002 243 984 63900010 878 527 47600510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.3 %106 134 66199.3 %0.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %105 944 84099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %189 8210.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %53 453 74250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.1 %102 790 53096.1 %3.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.1 %3 364 4913.1 %96.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 976 950112 38973 147128 93598 125102 729116 716147 39377 740120 72957 74850 40567 32577 05644 80787 69061 96068 16899 557143 909136 815129 114164 062126 129193 397301 33627 153503 02835 96333 92260 46465 10433 92477 08033 48433 75749 78962 33322 78391 3251 281 00773 39072 476117 63399 366180 233158 017230 199362 33459 17869 03256 66172 80040 64062 40763 60245 847149 32250 57190 85194 882 433051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.82%99.78%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.82%99.84%99.82%99.81%99.83%99.81%99.82%99.82%99.66%99.71%0.18%0.22%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.18%0.16%0.18%0.19%0.17%0.19%0.18%0.18%0.34%0.29%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped