European Genome-Phenome Archive

File Quality

File InformationEGAF00006165017

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

137 032 156257 391 268364 908 156422 082 570417 220 481364 150 816287 387 314208 469 220140 989 96189 883 24454 526 96631 753 93217 978 43610 003 2355 568 0823 153 5321 878 2171 204 083840 694632 018504 903427 030370 396329 654292 169263 525240 063214 962195 404176 245160 847144 010131 505119 829108 88398 22689 67282 08474 55468 54763 40759 26254 64750 88246 89244 15342 71240 29838 07435 49933 18931 90031 14229 53128 50527 18226 15124 96423 86722 48521 94720 53920 18820 01819 18318 61917 48816 78016 58415 88915 31314 79314 47114 88313 80813 34712 83112 72712 34212 47112 13412 06511 43011 29510 61210 74510 48510 35110 3279 9899 5299 3559 2858 9288 9428 7328 5488 1348 1057 9448 0867 7437 7867 4977 4287 3727 4127 1856 8686 5496 7236 8536 8956 7816 4456 5666 3306 2315 8596 0955 9795 6075 6765 5995 4445 1895 0975 0615 1805 1074 9814 9354 9604 5894 6244 5854 3964 5334 6944 4474 4204 5094 2784 0314 0944 1274 1163 8883 8343 8493 8363 8163 7043 7163 7593 6953 6973 4993 4553 5203 3953 3163 2073 2133 1603 0883 0713 2053 0302 9692 8362 7042 8992 8242 8732 8692 8262 7562 7242 6472 5802 7262 7082 6522 5252 6152 4852 4672 3982 3672 4042 3662 2502 3642 4432 4592 3462 2852 3122 2682 2792 2342 3152 2592 0992 0692 0032 1161 9782 0261 9251 9991 9661 8701 9111 8421 8831 9291 8931 7971 9571 8181 8991 7801 7411 8011 7021 7091 5981 6161 8141 7391 7411 6201 5951 6211 6231 6561 6911 5621 6431 6131 6271 6721 5471 5031 6641 6161 5921 5271 5181 5751 5061 4181 5541 4381 4611 4401 5521 5041 4611 4831 5591 5541 4691 3751 3761 4491 5131 3681 3741 3441 3291 2321 3171 3041 2951 2721 3111 2961 2931 2671 1561 2381 2391 2361 2651 2291 1501 1701 2051 2161 1961 1471 2061 1891 2931 1251 0711 1131 1561 1471 2321 1251 1071 0851 0851 0399671 0871 0791 0571 0461 0561 0881 0031 0839999831 0141 0891 0361 020926963959914984912935901878978928967896834908939940943883956950893908896882888900876798866802854853856861787783859781781859758845823761814814830794799755795761831780768803738739778719727758823731760688706655708699780731661680663668658636625595612622654633772637658606633588679627612673654612575609556616626629621657585603614625576604614619629578609575533575536573569554581589628582549547598576562594649600554617584606578596564593530571554517491538571560541573566543555569560559589612591580612611607565568601605596661588626657553602611614580554581527578578524532500568643563496501478549510490480509506537487523486508518494488488498450435475495487516519576457493465446441466483437390402432429423416462430415409416406400383419383365415372401410403396415445434427442422469426417415430424416362415376343357366366412382413405385376319342373354370369370356373379374369348340382357311349336353333345315325337328347338328327335324320335328344325325315351374332345328293314353334298331324314297328341349297321306343335307298284276330271260281288314312291278272270318323270339303316291329312293333272308299306299309282329309295318292265266267276275266277278275277279259300319295301292288272258253243258268244256276286256249246257244244241231247277245238249222225263223202219233218221218221230236223209213219236221224217216217228241213214182197228196203211227203191232196213202178224188241261270200202202190229192186213193187191214199188194178184191180196220198207209212186195201209202185199200212217211193195190192233208202204220210184192223197177197166176191209203193190189199186189219181210205202198211192193211188187203173204192212196192190194198203213176191214201189218192212207231221243216191201213171210193190211210189200205176185179201216191166189205207202203183194186200189173186157195181187172149190188166172184166182172169152195182205193185170195158184170194185176185148175141173186176171192183174170167172173168191162193184197209198219206186190235191194212195199191185189184191182188196182171170186172159197191186158183191193199185201 455100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

171 8980000000000722 450 42300000000000001 006 282 6440000000000014 293 721 04100000510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %105 924 27099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %105 804 82499.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %119 4460.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %53 055 05350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.2 %104 216 16898.2 %1.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

11.9 %12 613 24211.9 %88.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 178 55993 18456 612131 22182 22786 86199 306135 94359 558103 80945 83147 40855 85268 67444 92379 36651 21358 12580 917112 875119 358113 053137 220109 058174 599293 02316 794526 39125 21024 07448 95752 35423 19263 32425 65126 33843 03558 95415 06688 9181 334 69459 48157 23396 71883 591154 491128 513225 254317 76736 69849 91244 07257 52628 75048 37552 56437 244138 31938 71278 15195 925 879051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.86%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.89%99.9%99.89%99.89%99.88%99.88%99.89%99.87%99.88%99.89%99.88%99.52%99.76%0.11%0.14%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.11%0.1%0.11%0.11%0.12%0.12%0.11%0.13%0.12%0.11%0.12%0.48%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped