European Genome-Phenome Archive

File Quality

File InformationEGAF00008412972

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

75 311 612148 843 237232 662 806305 700 215350 772 303359 888 563335 473 610288 435 031231 076 169174 310 830124 796 06385 398 01256 114 37635 768 19322 217 66513 567 4238 205 7514 986 0213 101 4791 983 5131 342 781965 122728 491585 079485 521417 050363 672323 352290 621261 815237 926220 581201 614184 904169 891155 947146 032133 720123 452114 705105 68396 68789 03683 51678 04973 51568 17264 21560 29355 61352 79750 04246 56043 84042 14440 35838 12236 89834 98933 81732 44230 80229 54628 03327 17726 15225 11224 35123 69623 11022 77321 67421 39720 55519 82519 24718 45518 34017 48217 29816 33115 97215 92015 41214 88614 37014 23814 26013 64413 51313 33113 24512 86612 23312 46711 98911 43911 47411 34610 89210 79710 42310 37910 03310 18810 2479 6899 5419 1728 8218 5438 4838 2878 3128 1647 9348 1487 9637 7437 7707 6107 3247 2867 1947 1496 9076 8496 7526 7236 7976 5146 3026 3296 3066 1076 2556 1885 8945 9945 9195 8375 7345 7495 4965 4715 4795 2755 3055 3595 2225 1794 9104 8164 8584 7174 5754 6254 5254 4224 5844 4684 3284 3784 2364 2143 9804 1064 0283 9564 0183 7363 6683 7033 8383 8013 6433 4633 5063 4643 3923 4513 4833 4093 2693 3093 3103 3463 1863 2153 0603 0313 0773 0903 1142 9463 0112 9342 9802 9182 8482 8322 6992 7092 8212 8752 6802 7462 8092 8582 7292 6352 6032 6252 5752 6392 5172 5782 5432 4762 4982 5962 4272 4102 3462 4512 4382 3612 3992 4092 4382 3032 2752 3552 2142 1332 1072 2102 1602 1171 8981 9832 0351 8052 0171 8802 0151 9182 0071 9001 9261 9711 9111 9091 9212 0171 9701 8971 9161 9441 7841 9151 8421 8001 9151 9301 7901 9121 7191 8661 7891 7051 7131 7711 7941 7221 7301 5901 6921 6621 7001 6841 6091 5921 6271 6081 6491 5451 5841 6231 6181 5471 5281 5561 5781 5611 5301 4421 5271 5071 5361 3911 3911 4001 3931 4321 3241 3741 3481 3431 2631 2861 3531 3821 3141 2451 3241 2821 2841 3041 3521 2481 3111 2501 2271 2951 2131 2761 2161 2151 2141 2601 2401 2721 2261 2371 2311 1141 2321 1951 2031 1031 1021 1321 1191 0691 1101 1551 2041 1381 1561 1351 0971 0651 1371 1221 0921 0471 1261 0451 0521 1101 0601 1599759979951 0019549841 0271 0069449819609709569249209759069189579619679529309069159289148839048308989009451 028881899867874829834776857871811819907852820874880876911886822872815811817783810798820782798777787745818713728738763776771713694741714723729702799748740727792788810832721724766765734720727748725699671663641651677628673720703686672672693633665646667689613611593624650683606643676626576606618639573596610575626635644645615610629649615566599553580549573569519558564553560550488525522500541556478514517556506465487507479533548513507554477496498534549510503503490474515536510464455523462434440539465488502413470523456480506503501476507462507504513494471490426495513494444453419431449406457452404444473431413408468446413484416436453450436431430452430417402421432426468409427426420471475476446459458493440456490490425464479506462505488452456436445466443413443453431427444459436510454471427475460461483478481462430472433417451457462457436459471455437443433456400425430405481482464383416404410419431471441415487404430445448455457450470469461423487472420399407398409455408392449441439418426440416449434426460491460442462449458452471468478470475437479491449447438440448418434490461449460418424418440438469442467480430425418472456481491471460520514509549486542554573538516498517511543520505508500516511529516521494514518542508538540477532460536468499476483477486489469469483460465467444440527485485475497450507466483461494481498498481493477461459521488445456439425443464445438428435442454404433471428382421406410433409440434450424404451353397411385395364341384345395348382335352362347339363316339324321333324341342316322325327358309302333320306322317315333315301287279282285274310307258255284287282253267303289270314275310275292263279286317283291285275258284258303262298273238244271249246285254269283238254260385261240239260265252212235245261229230250244245232220264258258225210257235234255229231253239234253212218244247237263 262100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 495 244000000026 546 570000801 868 542000000000470 303 5830000575 487 99400001 244 380 29800002 632 978 52400014 810 315 24300510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %135 572 57299.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %135 311 28499.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %261 2880.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %68 090 64950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %133 257 31097.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

11.6 %15 824 87311.6 %88.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 391 226127 36774 669152 583113 822121 485133 516192 58681 171144 39363 98355 62579 13291 99543 775106 16279 58992 420109 505153 272157 079154 355188 895145 478241 924410 96822 565688 53133 90732 70167 31175 29133 45987 70235 60835 15159 06781 69020 083121 4461 708 38279 38275 162130 034108 166206 878174 850275 679435 19047 42864 59856 50272 66233 73363 74465 71646 685180 08247 54699 915122 091 994051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.79%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.8%99.81%99.8%99.81%99.81%99.8%99.82%99.8%99.8%99.8%99.88%99.83%0.2%0.21%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.2%0.19%0.2%0.19%0.19%0.2%0.18%0.2%0.2%0.2%0.12%0.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped