European Genome-Phenome Archive

File Quality

File InformationEGAF00002061308

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

17 496 04542 948 04187 465 007146 815 702210 470 216264 852 736299 289 203309 591 957296 931 897266 944 233227 066 440184 160 486143 428 267107 786 47578 583 34255 842 33838 906 09926 622 92318 050 02412 156 9438 184 9985 527 1743 798 5922 644 1471 899 6131 405 1761 060 681838 343676 829562 693475 548407 927359 098319 006284 676256 237232 690214 728198 221180 956167 715155 796145 733136 466128 007119 050112 972106 06299 76093 22087 67881 71077 09573 67969 29666 50263 07859 40355 84252 73050 84948 01046 27943 78743 22340 25839 25837 15536 00034 47433 35231 94031 00429 91829 19128 16027 16426 77126 44925 68924 69924 47923 76023 93322 62922 39321 61521 10620 76620 05419 49719 11318 28017 78617 35717 08316 56515 91015 64115 18514 62314 39814 45914 21314 22114 25513 89813 37113 23312 91912 94612 78412 39312 13812 01711 67511 41610 89910 97211 19910 60710 09110 09310 0009 6459 4739 3429 2839 0768 6638 8678 7578 6088 6588 6728 3568 3938 2738 2058 2678 1497 9407 7917 7557 7257 7387 5927 6687 4877 5066 8887 2067 0787 1516 7656 6916 8066 8336 7516 7936 6126 4636 3736 4646 1496 2636 1166 0606 1465 9345 9565 5715 6445 6365 5425 4765 4405 3215 2805 2895 1565 2085 0825 0274 7724 6584 7104 7814 6274 8364 9454 5974 5644 4574 4944 4194 5694 4444 2644 2874 2674 1134 0523 8703 8693 8553 8443 8113 7273 6403 5883 6003 7293 7173 5573 6013 5233 6453 5043 5873 6233 4983 4043 4513 3933 4573 4063 2973 2253 2673 2713 2443 3083 2253 1373 0693 1613 0673 0823 0913 1272 9503 0653 0142 8332 8422 9632 7912 7132 6452 7442 6842 7122 6222 5982 5972 5702 5412 6012 4742 5322 6152 5972 4792 4842 4582 4112 4502 3922 4712 4162 3292 4242 2662 3152 3062 3222 2482 2712 2552 1592 2192 2112 1202 1212 1512 0512 1632 1102 0192 1382 0331 9402 0012 0402 1451 9671 9331 9411 8621 8071 9521 9471 9452 0461 8251 8861 9281 9491 9051 8321 8781 7871 8101 7831 8031 8831 8061 8931 8741 7551 7061 8001 6521 6941 7221 7351 7231 6811 6301 6971 7591 6871 7011 6021 6551 6461 6191 5941 5631 6461 5931 5191 5251 5231 4671 4581 4531 3991 4661 4911 4891 5681 5251 4621 4271 3661 4391 3961 3671 4201 4681 3431 3701 4591 3131 3141 2941 4081 3461 2411 3161 3331 3211 3261 2351 2901 2871 2631 2181 2551 1511 2231 1881 2181 1961 1531 1851 1831 1361 0981 0851 0971 0531 0631 0741 1071 0841 0701 0941 0991 1471 0881 0971 1011 1721 1051 1061 0861 1471 1221 0591 0761 0791 0879931 0581 0441 0801 1081 0641 0451 0451 0681 0411 0259931 0059981 0191 0141 0121 0431 0369599499831 0319331 0081 003967924919943948969931918864927913900892985943888893946885878881922849950887844911876893892904939907900883895892863926793810783792808859834765842843894775746730772719740781688738689745734743766760731710706742736720687658670708755739703741730701733666683703700638700702721652725746734797680690706700689652675639649633647679713642659718663683669699659677653660654655645662642597642666641640647595704699633608658653644594596652580580600667589604548607649552593631540526558583603594614594582569592623640564549581561565541615614568571619575554630648591555573584578503580577564593569598559591529578525547564541570536527571554532552520546541551481537495510513545519516499494457451442429454452439473473467454451503510498465522492481451449449431435405481516461428416432441397415413448450394434419405407427409388405405358416400418369399348381370403339392350367364351338361354371359394374309351330343360338327383340327375377335344379335371363333384357337348330366345322372365370341362362351326327360341341329343322344341330332323383325332364321333368355352367375352325345356337361349323320339363313320349360309351321343368345316318373347341307350380344358318349347340346343346320358336371337364338370348370364406356362361351364342378394340383357360364355343348362309340346355372358351348410358362382368378380348370373357342327348341330354345377331355381379365370362364328335357328334330326345335340324339347321338339372382358369373377371371343385366370300375344400326348337304361337348301338339285328272308307298271300286315305307303304295293285309257281289264303290275278266243277308249265277266232241246275264249268277263232235244230222267223229264244261253274229254242255248256258232229291 316100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

15 987 9920000000127 569 4840001 608 924 679000000000870 696 9730000938 335 03600001 865 432 19800003 675 724 43800018 624 082 38200510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %182 476 51499.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %181 622 01698.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.5 %854 4980.5 %99.5 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %91 810 44150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %179 044 40697.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

15.2 %27 823 96815.2 %84.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 921 013180 120113 691213 349165 991170 869204 149256 934114 952192 75193 86579 974122 130124 75864 719145 39998 985110 048159 202205 959225 524209 881242 489191 567316 838503 76032 380875 49248 51047 098106 09895 15244 293115 78650 07450 14091 063105 46429 693154 9352 530 966112 463114 945173 999154 688272 115240 169354 736543 66074 64791 98285 777106 44762 448109 037103 44179 765249 51085 538155 185164 182 093051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.51%99.5%99.52%99.53%99.52%99.51%99.52%99.52%99.51%99.5%99.51%99.52%99.52%99.53%99.51%99.55%99.51%99.5%99.55%99.5%99.51%99.53%99.18%99.39%0.49%0.5%0.48%0.47%0.48%0.49%0.48%0.48%0.49%0.5%0.49%0.48%0.48%0.47%0.49%0.45%0.49%0.5%0.45%0.5%0.49%0.47%0.82%0.61%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped