European Genome-Phenome Archive

File Quality

File InformationEGAF00000065099

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

52 582 052121 461 502216 840 176313 799 805382 563 197403 343 694375 165 034312 766 614236 554 644164 114 239105 182 03162 882 21335 356 69918 941 5039 775 1025 001 3422 627 4121 484 615937 129666 711528 525440 684378 132330 408290 313258 325228 419204 426182 181165 429148 999134 544123 147113 760103 98496 03986 82278 69771 29766 37859 46754 95650 71446 47843 11239 43436 45033 94931 74629 13626 50224 65523 18721 70120 50719 14817 95117 02816 58115 47914 43913 62313 22112 28311 72011 64510 96810 54310 2679 6739 1548 6478 1138 0007 6597 7707 5097 1116 7846 5226 2266 0695 6845 4355 3375 1484 9704 7894 7074 6364 3894 3694 2434 1824 2304 1183 7253 6833 5693 3613 2103 2353 2213 1093 1463 0363 1682 9402 8812 8342 7632 7852 7702 6862 7722 6372 4752 3652 2692 2772 2682 1892 3122 2912 2952 1132 1402 1052 0672 0312 0612 0651 9431 9462 0011 9442 0181 9751 8411 8361 9011 8481 8101 8641 7071 7551 7231 6541 6161 5881 4631 4951 4241 4841 3771 4321 4301 3961 3891 3501 3051 2661 2331 3131 2431 1711 2211 2121 2291 2101 2531 1611 1981 1891 2031 1781 1271 1801 1311 0801 0791 0631 1491 0841 0631 0411 0541 0189869709848849529219719549129179689499839839549579529378569208679068918509368538478188387637648087088117727897607578007658048248248147717687457948278128058028187727067917386907057176727056926497006906386946947106746666896416526716207297436586706496676856496566565926346096946396165625796025735936105535335095565225515825555765785835585735605905745666015365735385475165415185125114885065475015125465064554584724814664874924874664584384714504914924954974904794594594314634614284864434594804514384384604394534844504574344214064354164454033834394324324084854384444044383674263854243754173804264303754263873444184363673983774073763573973594273754173923843883963964033663773653723373393434134063783613933433843993943633323583743303183143143233043472882793193493133123042713233162993053043112713012892933053032753123032922972662872553062313032782582722642812873242993482652793142752822833083172912833092733012832332732542682713062762692782492812782672662492682782662502662702602472372632702772842542782412752892342632862722782552532652602402602393002542752472302292452292182122272142172122192302392472032022351962182152192142232311942262312192291981992192191821972042132012102342181982162182082201991872231842002132051971971881602072121782091842111892021741951771871881972061801951982081861881882151812131852262112032081981931871911772091962042012091812071781861771591781581901871631911561751811781631911551691661691591912001651581801661761841561891841911911961711782011811821991741521711691861791801791961801771841561971581481951861861671761711581671851651901371881671571761521711491441511501881681521611461581591881581651441721931741611601621601491711451431381671641401431831671771371321401391421411221361311411481421361361371601481501361571411391361341331241341471441351491591711641521781981581371421271341241421571251581431171301321181301221131091281431321581211051431291391361201061091401151281321001311041331201001161241071201101231121051101139811312613112213012410412011012913612411112111612811410712511713011412513912110496119112106921181011229611012410611611712211299116110101113118107116971011291271091181191161161231081341311121141181321281071229911512112312213513313111511312411911313311310310711312613412213111412011612212510111114213116410615112498101124130106125129109109116127101128131141123128119133125113110104111103112113123113106911061071131351201121091171051051071121248210510814396122107112117108118138129120121126 147100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

001 749 162068 603 37964 668 05697 600 85784 569 508136 095 74688 581 57798 559 553131 051 370138 238 350206 811 525115 258 86982 086 587112 157 638105 230 434134 478 319128 396 335170 971 839188 191 710173 877 981194 834 203185 160 964216 260 832213 135 324286 728 782294 803 298311 090 070333 735 447503 429 374342 532 985806 486 594598 049 6051 130 827 8241 755 844 5222 112 860 3262 047 332 1446 368 345 57524 123 73600510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G5G5.5G6G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

95.3 %191 046 33195.3 %4.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94.4 %189 333 02194.4 %5.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.9 %1 713 3100.9 %99.1 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %100 263 80250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

93.8 %188 188 19193.8 %6.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

1 %2 093 3061 %99 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

15 934 05349 34729 14487 42627 86150 88134 48252 88142 675363 738192 525106 201330 10976 48969 205758 50584 1701 301 557171 39349 992266 64618 780144 936673 82614 147245 87413 77413 05711 84022 773 95612 64110 00010 93414 03411 95616 652418 0002 688 79332 78018 31250 49042 94422 86283 36629 81446 884226 29452 91292 11662 150181 97842 130118 01288 582126 306193 546378 758151 464 888051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.1%99.14%99.51%99.4%99.46%99.55%99.27%99.37%98.87%98.16%98.9%99.52%99.59%99.48%99.16%98.61%98.77%99.38%98.69%99.45%99.2%99.21%89.32%99.03%0.9%0.86%0.49%0.6%0.54%0.45%0.73%0.63%1.13%1.84%1.1%0.48%0.41%0.52%0.84%1.39%1.23%0.62%1.31%0.55%0.8%0.79%10.68%0.97%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped