European Genome-Phenome Archive

File Quality

File InformationEGAF00008413133

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

98 378 222184 955 251272 440 424338 320 940367 555 473358 486 084319 544 341263 930 788204 539 206149 740 557104 599 94370 147 38545 428 99628 625 27217 664 21110 804 1856 571 2864 055 0142 565 9281 698 3281 190 577883 108684 294558 486472 539407 861352 301311 470281 031256 095231 963210 444193 902179 342166 391153 843141 034133 533124 196114 196106 27199 95993 53188 19882 51977 42272 39467 70063 30059 67657 48653 35550 49947 12645 07943 01040 57738 47436 49634 34832 27131 17229 98128 39027 35626 33325 05324 21324 13522 60121 72121 18520 77819 79619 47119 10118 33317 46917 23416 87016 33415 85915 30714 76514 32214 47913 83613 34913 46713 41513 27313 30513 10012 35412 33612 12711 98911 92211 57111 43211 13910 83910 54310 43410 2779 9769 6939 5109 2639 1689 1018 8659 0308 6888 5798 3548 3258 1648 0487 8007 7587 7397 5747 5037 4957 3587 3077 2477 0377 1706 7546 8006 7396 4936 4486 1026 0485 9136 0915 7705 8145 6205 6755 5635 6755 5575 5185 2475 5735 5695 4115 3545 2615 1635 0725 2515 1065 1574 9674 8974 7854 6634 6934 6164 4894 5754 5324 5374 4794 3734 3564 1914 2444 0734 1334 0594 0583 9113 6963 7613 6753 8443 6783 7093 5593 4353 6123 4173 3303 4373 4153 2753 3053 3983 3163 2253 1453 1993 0723 1182 9782 9202 9022 9232 8612 9042 7972 8602 8452 9222 9442 8472 9312 7142 8862 7132 6942 5542 6142 5842 5912 5592 6292 4602 5262 5672 5072 4592 4452 3602 3992 4172 3692 4342 4042 3922 3532 3812 4122 2782 2742 2202 2082 0962 0462 0442 0742 1632 0942 1042 0252 0832 0812 1302 0801 9692 1102 0732 0542 0581 9681 9391 8351 8981 8551 8511 7841 8571 9871 9151 8891 9331 9881 9571 9501 8931 9181 8991 8171 7201 6761 7511 7201 8621 7281 8221 8021 7071 6901 7691 6851 5791 6361 7371 6311 5701 5861 6221 6531 5791 5651 5751 5551 5451 5481 5321 5501 5351 4301 4541 4261 4501 3281 4711 3931 3971 3661 3921 4201 3781 3041 3691 3201 2931 4081 2531 3621 3361 3001 2971 2241 2501 2301 2841 2571 2461 3151 2481 3311 2901 2841 3111 2791 1401 2071 2091 1951 1521 2061 1001 1171 1531 1411 1281 1471 1591 1101 1631 1591 0881 0501 1281 0191 0521 0541 0951 0361 0151 0281 0491 0561 0209879921 0929901 0039579671 0251 0079599589949691 0191 0079709779409579039931 0318829609399239259249859461 006941979952944889883900930919897819927809841842863896883847851857803886816832805786804808803803808831785754787801764818824774807762758796750778790802786791780777787761724757713796742756735767780734729743784762759749757808727700733712663660629665694691682712688663722674700699681664688736736715636657685612650666627639644670634634692595637616696588621653622627614625607607639591623588626654642658610620636616604604636556606580572645569583587686601621643592627598613638599607566609556582606595585580578595603591517591555598574573534549566597536562560514514543497567537494499505523533521509549520477533471425455503495450464460484487451521472499480486529533503518529504535458458432494496443457445436447451500464436463487454458487480449476486478478447480458419453455474434460501425438418444395421454425427498438440500458478448476468412430433456438435419423453437472429449406427421426412421422412425462442445428450490490501466476445460441454480498416465409491421439427447485455475438441473412440416458414425408447421409381404383431386374409402403443409418417400446427414439398415417382434404409409438424383402390432395393411399377417357382375403382371402396355369414383343387380370346382352384359412364405384412426325373357333327323305341331309325344360349334330325316319305339318309344284258340286275272261270273270269267280260283273250284296265249264278252252271272257294279240259280286302282253238300222225230261290239252277208241233225231248229222230202226221253243192237237222264256218204223257252216232214217199214228207216222218198205201201186212185184204207202204209190196201192192177191167187207173188202178209181231180185188182188178194165196189189164194183180166178164194156178188158195195187209181164189191187182186211155194180181189181194184204202182180178183171160185176186177168198167243 252100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 786 576000000030 983 628000710 868 691000000000413 829 2430000485 091 69800001 069 928 02100002 219 657 55000014 210 937 49900510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %126 193 56799.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %125 845 37499.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %348 1930.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %63 391 00350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97 %123 018 44497 %3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.1 %15 373 19512.1 %87.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 180 884117 38572 062142 793104 424110 082126 513174 61384 006135 12459 70952 25975 04184 22644 614100 32469 33979 892105 146149 961157 875150 638187 522139 866228 263364 80624 919630 53335 15133 45166 76671 17737 03286 59434 84935 19655 75374 83120 872113 5041 534 21477 62174 594128 339106 145196 113172 763250 810442 31848 93064 66654 95373 47234 67372 86464 48947 378173 28748 35997 744114 256 114051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.72%99.7%99.72%99.72%99.72%99.71%99.72%99.72%99.72%99.72%99.72%99.72%99.72%99.71%99.71%99.74%99.73%99.71%99.75%99.72%99.72%99.72%99.81%99.76%0.28%0.3%0.28%0.28%0.28%0.29%0.28%0.28%0.28%0.28%0.28%0.28%0.28%0.29%0.29%0.26%0.27%0.29%0.25%0.28%0.28%0.28%0.19%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped