European Genome-Phenome Archive

File Quality

File InformationEGAF00003441135

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

161 883 112274 397 952357 507 650390 482 467375 208 043327 222 813264 700 159201 879 666147 358 495103 984 90271 703 08548 736 06532 853 31322 117 31014 957 40010 162 4126 969 8404 815 5333 367 5552 386 6641 720 8711 274 303958 694740 902581 180466 649386 388324 102280 240245 265218 016194 209174 838159 420144 763134 559122 343113 086105 23496 13090 08083 59477 70172 02968 72364 04061 30957 08154 19051 32849 12846 40143 91542 72539 89737 96136 87334 59933 86931 94230 85829 89228 77028 47828 07026 25326 10524 80924 60424 14822 70622 02321 52920 97920 67919 52119 23118 93518 45918 50618 06217 47917 63316 85716 02515 96415 27614 94314 67314 31914 06913 73313 83713 23012 87612 53312 53712 03111 93811 74911 97111 58811 67811 63711 46311 44210 88010 81110 29810 27610 0799 8829 5609 6179 3609 0909 2409 0038 7948 6678 5358 5128 4408 3028 2578 3718 0497 9307 8147 3997 3827 2947 3077 3657 0447 0046 8336 7676 6956 5546 3706 4506 2056 2486 2006 2536 1466 0255 8706 1525 7625 6155 4235 5675 3245 2575 3855 2235 3485 1805 0374 8894 8145 2264 8644 9724 8564 6794 5664 4354 3884 4394 3694 2014 2484 2424 1414 1004 0963 9863 9833 9443 7943 7483 7103 7693 7313 6733 7273 7143 7853 8043 5893 4353 5033 3543 3663 3523 4053 3293 2553 2953 2683 2663 3593 2713 2543 1733 1913 1843 1383 0653 0933 0823 1363 0503 0443 0602 9832 9212 9892 8612 8662 8482 8372 7162 7222 7712 6482 8182 7682 7242 6992 5482 5242 5042 5602 4142 6302 4892 4052 4312 3312 3792 4092 4042 3102 3542 3222 2022 3502 3312 3642 3582 3802 2972 2112 3082 1732 2802 2052 1402 0622 1212 1142 2092 0712 0282 0181 9511 9802 0032 1021 9582 0081 9031 8841 8531 7991 8441 7471 7211 7871 7861 7111 7211 7661 8051 8421 7001 7541 6921 8331 7011 6991 6561 7141 6781 5641 7161 5371 5451 4951 5791 5471 6551 5611 5721 5311 5811 5531 4991 4671 4441 5401 4611 5131 4591 3391 4051 3141 3431 4121 3681 4171 4451 3821 3851 4201 3471 2741 3181 3091 3221 3481 3151 3041 3111 2931 2471 3481 2961 2841 1731 1781 1981 1771 1731 1301 1631 1001 0491 1431 0491 0781 0941 0959831 0241 0389871 0931 0511 0121 0431 0431 0801 1071 1201 0191 0821 0371 0141 0221 0601 0221 0881 0341 0291 1021 0191 0021 0029719769631 015928921948872970920866941906982928961895919872940875868881917941874922892913943894863903942942912844874908908784805818842808904855834874900809805801835800819830857808823805794831751792749796746786757745734762792724814778779753757807784755746705807782736762744801760770771779784792718718769723692728699688709726682696621642674682629616602632634593692637626620614608621621606616619575601661600612603581567628556593611571538552607533592513583526547558542554569562533513521574564547512568501574552570592523539500591525546511521540534485563515552552527535541576536519547570525523507514541503473467483488526500529502499519519512520507506530563589501535501471467479501502495524461492524435468458451475467492476453490450482490462511451437452431412417474475458489485466486456452483470521489463446487475475535449483450450462425459411465437431411433449417426378398416400423398456422394425423394437386412458435373398435416401389423402429448441391426438440442458446388438446422418431418450435429455425453440449427411453442474477478458433493462451440461479461528477465487455415471485456517473513478497475505478502448454464473443435443448477441405444432410473456443445424478448495490488461500471506443452501477440456489531453449489455437513442495439462478446454438463468451444437450434416410422429363370405407410438372405392427436382390398421392410383381395390385405389392372388396393396397394394408363387382384390398348373360366344350392347376381364355383375389375345374365347357336326363327347349313317370327302330329334315332331319345316322301292302324323308309321329283318284304277319287312276279271254271287287300266273243289287243267281249235277270252277285247233269258231253270246226276241254249256251236279213216239237265245247254242240265242253260238231239238231203248229254226203233231229180229200220228219208218228220207232215203213220207192205231 073100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 571 891000000032 997 696000908 574 629000000000507 831 7120000566 591 33600001 176 711 95600002 433 872 94900011 492 358 82100510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.3 %112 598 44299.3 %0.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %112 336 07499.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %262 3680.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %56 693 74550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.6 %110 633 60297.6 %2.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.4 %7 267 7416.4 %93.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 988 042120 43375 787138 479103 686105 377124 385158 12774 619121 70058 75350 16873 99679 53145 30994 15972 33877 865102 652132 860148 248134 540160 936117 170188 823311 32522 618564 12832 89930 77865 02262 71328 63375 56032 20030 94652 33665 12318 238100 0881 498 15468 13968 151109 08592 396171 032149 065223 589356 64344 31057 20650 54865 23434 38854 19059 81047 096152 67647 72489 772100 573 489051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.76%99.75%99.76%99.75%99.75%99.76%99.76%99.76%99.76%99.75%99.75%99.76%99.75%99.76%99.76%99.77%99.76%99.74%99.79%99.76%99.75%99.76%99.85%99.8%0.24%0.25%0.24%0.25%0.25%0.24%0.24%0.24%0.24%0.25%0.25%0.24%0.25%0.24%0.24%0.23%0.24%0.26%0.21%0.24%0.25%0.24%0.15%0.2%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped