European Genome-Phenome Archive

File Quality

File InformationEGAF00002307892

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

55 270 689102 055 763156 760 377212 391 389259 536 959290 836 345302 135 062293 027 644267 211 478230 384 008188 893 594147 963 827111 103 08980 281 83756 110 64038 016 68425 149 10716 306 60510 425 6556 634 4844 246 6922 761 9581 869 0451 326 086989 933776 529641 020545 543471 724418 119379 641344 261312 133282 672258 511239 900223 509205 048191 620179 027167 117156 130145 887138 333128 500121 645112 997105 93799 74094 87590 97685 38981 44176 16673 93369 96765 63862 24058 73156 88754 12150 83649 68746 77545 55743 22342 42341 11439 65237 98036 21035 13533 94333 05531 98831 03629 82429 11427 89427 32026 41125 15624 14223 92323 74723 23622 17921 81021 65720 90220 20719 60219 34418 84418 63318 51517 77617 23317 04616 87916 01615 66215 22714 88514 57514 06713 78813 59613 39412 97212 89812 66712 08512 07211 80411 52911 27411 37811 02610 97610 69510 73610 42010 31410 1089 5129 7779 6849 3149 0459 0048 9138 6208 4058 3938 3387 9607 9767 6967 9187 8407 7717 4297 3597 0017 2147 1996 9666 8136 8516 7646 6176 7896 6896 7256 4756 4356 4156 4066 4486 1336 1076 0485 9565 7175 7535 6775 5555 6065 3895 3125 1075 0715 1814 9855 0374 9164 8554 7404 7874 8544 5924 8284 5824 4704 5024 4984 2634 1114 2974 2664 1314 1184 0503 9973 9843 9474 0944 1764 0133 9813 9853 7503 8133 7633 7573 6893 7183 5783 4293 5823 7223 6063 5673 5363 4293 3873 3703 4783 3773 5013 5013 3493 4443 2173 2403 2753 1203 2663 1963 2063 1573 1853 2373 1442 8972 9822 9802 8792 8642 8762 9362 8262 8672 6672 6762 6832 8462 8212 7922 7322 6572 6202 4822 5242 5802 5122 5782 6452 5242 4672 5752 4682 5372 5062 4832 4412 3562 3372 3172 3152 2952 1742 1912 2062 1672 1452 0672 1042 0522 1072 2582 0622 0852 0562 0382 0591 9902 0732 0872 0141 9662 0922 0012 0451 9502 0112 0051 9832 0111 9571 9181 8941 8131 8551 8601 7981 8671 8841 7501 9881 8591 7311 6761 8161 7821 7751 7251 6891 6421 7451 6941 6441 6011 6321 6651 5751 5261 5291 5181 5121 4901 4721 4891 5041 5031 5381 4481 5401 5621 4391 4831 4071 4191 4631 4061 5121 4711 4751 4841 3761 3781 3091 4211 2601 2771 3391 3861 3791 3461 2781 2981 3381 3901 3361 3171 3341 3541 3591 3681 3351 2571 3281 3921 3361 3311 2261 2841 3001 2401 2321 2421 1481 2631 2301 2061 2721 2371 2941 1731 1951 1961 1101 1881 1711 1551 1501 1171 1391 1811 0831 1201 0991 0821 0511 0691 0581 0351 0201 0369411 0621 0231 0381 0049631 0449859949929641 0469911 0149501 0381 0119789601 0569601 0241 0189189599931 005953977920927890926886903819862868906860824863886882874875884856898855831888873942914909863862870796806806844846790808846778744722820814784711704729669747763743765727752678726711712695748692700713757703715713747669689716744694729718706726673683694710663617694679602578644633641641641629580568562628545555532575562571546615543592568579585581585578567591552548577533584580560543507609541514526533549608564571532536577566558530557568569544543581537589535621569534526582569548546581544551574518554580577546555560631626510536519575552548533553546556517514563585533542526485500519498548525521526561560540520509502513476495516501480511496549516535515493524499499472464471511477454479440477426439462459428462497447522407449446440417422421441437427432406400407434415429405414384399366428410431402476442437391452423405441418386405418427389433402425412402411429463431438456484417370422389420414409391409404419407330379394424350417404387362378385405409370368428398362391391378397380352364383381392416397394430387375375394391386387391403416353392411372408423402397419419389428391375419398418381365393382415397442421377387397396375399398383374322387403357386390402383344371384412359382387345396387363411406361376382379394411368388392396379396400397355383380354351389365363369352399397352348341393360365383418368381374371349366356342360342388341350338353371373350373342338338304334356363339333358315327335324287334355266326356277303306327317282308264288321300294290308303314342315299277252299302273277280316302272307280284271290259284259271292300258279275310312287263305327303328282283272254294276352292305260267261276282275253280260270240275271271253256267245256243270264267223253211244231249235229225240291 963100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

5 450 913000000035 631 940000907 293 803000000000558 521 9000000644 974 20600001 444 349 13700003 045 656 95200018 319 308 12300510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %164 537 07199.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %164 260 45099.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %276 6210.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %82 652 93750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %161 759 17097.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.7 %9 490 2985.7 %94.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 014 246144 08087 723171 856124 417130 481144 791201 12494 531158 04772 35263 59789 372102 81754 974126 01791 043102 236127 486178 070186 891180 239235 187180 261290 954479 94627 187875 02640 87140 41077 39482 95237 570104 07439 42939 51064 13886 35124 454135 8021 985 919100 55193 012164 993138 643262 059227 952374 529668 06663 00988 09171 14199 09841 91174 82381 53355 535250 63657 915130 984147 699 817051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.83%99.82%99.83%99.84%99.83%99.83%99.83%99.83%99.83%99.82%99.83%99.83%99.83%99.83%99.83%99.84%99.83%99.83%99.83%99.82%99.83%99.83%99.89%99.77%0.17%0.18%0.17%0.16%0.17%0.17%0.17%0.17%0.17%0.18%0.17%0.17%0.17%0.17%0.17%0.16%0.17%0.17%0.17%0.18%0.17%0.17%0.11%0.23%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped