European Genome-Phenome Archive

File Quality

File InformationEGAF00002831469

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

69 383 298139 077 111220 433 549294 075 883342 231 704356 084 418336 908 028293 596 477238 681 330182 594 067132 422 32491 825 11461 181 60839 408 28524 749 76315 272 5259 384 3695 763 4023 608 2532 351 5201 610 4771 170 623899 178727 397616 070531 971466 864415 506374 487339 389310 380280 625258 915236 182216 139198 649180 702167 842156 459145 630134 137124 924114 493107 54997 97991 44685 08879 12774 50569 71564 92861 05557 62854 84551 60148 91645 75744 13341 03740 23537 65035 79434 99033 06732 19230 65229 11628 14727 37826 88625 64524 48623 95823 42123 20822 31821 29221 05420 51419 54719 36918 60718 31017 75916 84516 86516 05415 98615 58115 45114 92014 94214 56114 07213 71013 91613 61513 23912 45312 55412 28312 18812 13611 87011 44911 06111 06310 79610 43410 34310 0599 9209 7709 6579 6709 6169 3709 1908 9518 9078 8558 4628 8548 2978 4018 2368 1717 8447 7687 8437 7877 5347 3177 3897 0756 9736 9486 5986 7316 6406 7136 4136 4036 5406 3566 0255 9985 8215 5375 4715 6235 5355 5005 6525 5375 3775 1925 3565 2555 1555 0515 3345 1755 1475 0325 0684 9994 8204 8014 8584 7154 5964 5894 5504 6994 5094 5864 6434 4964 3264 2924 3874 3064 2174 2434 2544 0534 0283 9984 0414 0003 8873 8133 7593 6673 7623 7893 5803 5183 4843 5063 5403 4863 4843 4843 3263 3603 2563 2593 1063 2893 2193 1433 1332 9232 8312 9062 9442 8942 8822 8172 7662 6702 5912 7002 7132 5732 4762 5342 5262 4402 3912 5162 5532 5322 4472 5262 5212 4132 4322 3902 3492 4082 2952 3592 1942 1732 1122 2062 2232 3232 2732 2892 1722 2062 2222 1232 0892 1602 1142 1002 1202 0612 0281 9661 9851 9711 9681 9492 0262 0871 8641 9061 7411 8351 7561 8821 8791 9221 8251 7591 7981 8081 7751 7821 7201 7581 7141 6871 6651 7221 6931 7621 8311 8401 6811 7021 6831 7481 6761 6961 6361 7041 7231 6811 6971 7471 6931 7121 7261 6711 6521 6081 6011 5621 5741 6301 5481 6101 5041 6751 5551 5061 5481 5861 5411 4561 4921 4031 3611 3911 3511 3861 3651 4321 3181 3101 3281 3031 3241 3381 2991 2191 2661 2071 2631 2941 2721 2271 2641 2161 1621 2371 2301 2581 2651 2091 2441 1751 2611 1801 1891 1781 1681 1361 1471 0991 2171 1961 1341 1111 1211 0671 0591 0781 0821 0351 0621 0611 0551 0729929781 0249871 0399871 0591 0039931 0559949381 0099759619739591 0109759949149348949428749109619568878529499369039811 017969924934922934916898884900875837867880893872912896839857850860844800906876873891751796838849796830808845879839908846853787828882840842786741733738730808772766680633721742694696704686725685739718706697729660647660670732671704782711686683679710782650631693678661686698671648652654612665638650658634642578614600611665632720635655666593617598619626608580579592573572589579591583549516519531565523510552544538576580603646616579577549557528526497472502507495450501507546485491468478478453488514502535470466495489509472486526493470485473444477480429448444448410445427434453426446456446449419410417449441456439406405411434457507417492395408428422463416452411445400380452393424418449407437395431440422388392437410412390403407419424442382386423389400429419391404357415363368376371414398365349411429422361358345334383380400382366351384370393372377388382369368328364379358393398380351398377382337414355371360381378386325302373374340323355348358326352344314325337341340339294330294347332324356330372324319324331293311300355297331349297368335330324328303327298340352325369374333326327325330345329298315335311316292320360343328330326338329323325329327327309318344350321341325311292358308317344324320322296295334329312344306299327295277315287285309281290304230260317282299310295277321306275283283314310305301285264323274262290265283287318304291300269267301254267294262264270307292277279257250267248256272264252280248294281249261274283243256247250321239271252252254275252236276242260248278258236267270277245281251260254259254265273258249252252245258248257246250267261265231270225261226265268239284259256254256249241263290284255242232216267247245254260235249237239214273266297267251222227247236262215243249243225234256217201246226230242253207250228206244229251227233240302 530100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 458 033000000031 486 477000867 313 995000000000492 117 1570000539 394 73800001 199 404 14000002 539 355 69900015 620 131 82300510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %140 265 76299.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %140 060 46299.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %205 3000.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %70 498 88150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %138 267 30298.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.6 %17 772 02612.6 %87.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 998 404133 17978 798159 340115 013122 386137 694190 86680 517143 24763 96057 28882 94796 74548 111113 50180 79191 435119 455167 365169 084163 608201 755162 599261 831435 67122 584748 75534 51134 15069 48675 69030 82692 84235 61435 21860 00381 80819 710122 0761 810 98584 13080 155139 142117 418223 384191 481302 518486 82753 37970 78261 00278 89834 64461 68070 98449 221199 90352 027109 066125 391 092051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.86%99.85%99.85%99.85%99.84%99.85%99.85%99.9%99.81%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.14%0.15%0.15%0.15%0.16%0.15%0.15%0.1%0.19%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped