European Genome-Phenome Archive

File Quality

File InformationEGAF00004837944

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

60 595 777120 000 086190 915 444259 774 159310 913 630334 536 807328 891 795299 270 692254 686 642204 189 687155 721 132113 495 53279 729 34954 171 42435 812 16023 161 06114 736 0949 290 4725 867 2413 746 3672 458 6951 663 9801 186 310884 330696 102567 018476 868412 485363 403323 253287 226257 616235 050213 426196 609178 863163 783153 322141 438133 045126 208115 595107 919100 11393 73387 79582 93276 54171 63467 85263 69760 30357 39553 85250 98947 90144 96443 18641 20939 52737 28235 53935 14932 96631 94330 07828 93928 52426 99026 64525 29924 58423 42122 29722 26721 58320 39520 55319 79818 91918 38517 31716 77516 53116 57615 97615 23214 80214 35914 18313 96713 34213 26312 93612 62212 40411 64411 57911 29911 27311 05510 99810 56810 51610 47410 1399 9369 6909 4919 0528 8978 8888 7468 5868 4568 5288 3908 0878 1248 0508 0837 6807 5807 5107 5547 2047 5547 2427 0516 9246 8486 9117 0836 6876 6076 6296 5916 2026 1376 3075 9456 0565 6675 7685 5865 4725 3025 2865 2385 4125 1205 2535 1994 9684 8614 8644 7654 7244 7194 9434 3844 6384 5154 8254 4844 5544 4384 2474 3474 2774 2194 1084 2464 1804 0934 1503 7373 6743 6943 8013 7033 7913 6403 6903 7853 5973 5593 5653 6493 2103 1033 2283 1123 0393 1133 0953 1443 0283 1553 0853 1173 0282 9393 0373 0073 0682 9072 9032 9222 8332 8192 6592 8962 9372 8292 9362 7482 7782 6262 7492 5832 6352 5792 4122 5032 5532 4502 3162 4122 4552 5132 4392 4802 3762 3642 3032 3562 3672 1952 1682 1352 0742 0522 0962 0082 0042 0482 0582 1142 1242 0912 0322 0272 0452 0632 0321 9821 9581 9392 0151 9201 9321 9741 8501 8651 9031 7971 8071 9261 9211 9671 9031 7251 8711 8311 8201 7391 7331 8291 7821 6551 6991 6751 7171 7381 7301 6951 6641 6341 6881 4951 5721 6121 4981 4681 5281 4301 4801 5971 5391 5451 5011 4731 4271 4401 4581 4091 4651 4441 4341 3391 3621 3471 3531 2901 3291 3461 2841 4261 3021 2361 2141 2931 3041 2841 2541 2811 2891 2421 2941 3251 4091 3651 3691 3281 2841 2351 2941 2581 1901 1651 2351 2231 1831 1981 1141 0631 1001 1731 0771 0361 0711 0551 0281 0571 0569929881 0529839749459521 0079341 0129859711 0021 0329969801 009989936962892928918974936906869906864871892844894912858844888905802932861841862868806792778802830826815797756775816749766830836771769780786751753741716745707748741749782822749752713711725699723696662686674699715648698720637729708702744671687737703693716687703627626683716741724685678687654672709645640625626658656643629665704673627598618565607612630557619640620672624624694622623673585622576609603566575590632589589591615588565615610649558615585577543577548560525529586571576521569584555606617582613545534508536559602544599535542595619606575559564545578545530576550586551560589568573568573581623703644526590584545548559569585554615601612543579520551566576591555569557619558564553548595618631616634570566558554553570548567702531592499591564592567582630663606567631587576662636656647643631647658641598597622575584570572588573535610606533499569547542517555570555601577551561609566552523582542541576514572533615570558538580560528528542543535535490542481489464461451479466551444453485503510437439447432427392434417392434447415383410423423404390370461430402427395355368391355374365357357355372415414363360358359352340323351297346363342372381364339314347333318355362335362373358322397327360363379373367334373366399371369338353327345314352349339326336310315334373341328328391341322385388341309317298307311330308290311324323319334307295296300275330299328291300314291300262289294289290270264278266244241247233259268268246283289283282245262295291240285259270255237266277257236253234284230233247232245207246253217258256290264275286261287273251234229264235246255224208224244261259242229233242221257250254251241238240245227240245258236255229239268258254245245256212250213223236265234246228252241212237274285271273233249237231245238248240240224249234240222239204238240222215217208245211216220192209228209201194219225206202194244231185227231226211206196228215215205208214207194212198205208193229226253 296100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 795 668000000088 361 178000992 637 417000000000588 991 5870000682 230 12900001 417 746 79500002 859 124 85300015 553 803 98300510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %146 229 13699.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %145 931 17499.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %297 9620.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %73 462 55550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97 %142 546 36297 %3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.5 %13 960 2989.5 %90.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 106 978124 26478 094150 788111 374117 680128 567179 43989 058140 17068 03558 32177 82888 79747 652105 46976 18582 979104 384144 071152 077153 080197 018153 725246 462396 68026 740685 74237 79235 99266 21973 15136 83090 13637 25137 44656 59775 78324 051115 9321 584 21287 97987 087147 080125 619232 107195 768319 986480 16859 79473 69965 45682 01841 29784 00075 52255 125192 60958 772118 577133 022 129051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.8%99.77%99.79%99.79%99.79%99.79%99.79%99.79%99.79%99.79%99.79%99.8%99.79%99.79%99.79%99.81%99.8%99.79%99.81%99.79%99.79%99.79%99.86%99.71%0.2%0.23%0.21%0.21%0.21%0.21%0.21%0.21%0.21%0.21%0.21%0.2%0.21%0.21%0.21%0.19%0.2%0.21%0.19%0.21%0.21%0.21%0.14%0.29%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped