European Genome-Phenome Archive

File Quality

File InformationEGAF00002340493

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

33 140 55371 066 944124 922 157188 757 505251 808 299301 153 943326 602 510324 379 687297 910 048255 109 325205 305 558156 417 778113 675 76979 185 78153 203 31434 721 41522 117 72213 901 0988 662 5165 420 6213 446 8122 251 9061 536 1411 095 301820 257645 308529 777445 015387 099341 621302 794270 265246 891229 055209 246194 531180 774168 075158 068146 370137 460129 865121 795114 942108 974102 14494 48587 99282 98176 93472 59468 40363 64659 77956 46452 80849 95346 86844 16341 94840 19438 08936 43235 40834 59033 50732 56530 96330 37929 66728 89927 60726 34726 12925 09824 15223 88122 90222 02622 09821 46820 92820 50219 45119 09618 96318 65817 90117 22516 55315 98115 95015 39615 47114 51814 56814 24114 18814 62314 13713 72113 34513 51912 97512 94312 38512 10311 74811 51611 70311 66811 48711 20311 19510 90810 80810 52610 36410 2079 9539 9719 9629 4929 4529 2729 5468 9079 1648 9608 6638 3728 4328 4038 1637 8627 9347 9537 9177 7487 8157 4117 3717 5117 6517 5017 2667 1837 0557 1777 3277 0316 8386 9426 7916 7826 5516 3536 4956 3166 3036 2246 0536 1536 0585 9395 7545 7145 7355 6395 3325 4065 4995 4895 3755 4875 3395 0985 2475 1914 9154 9654 6894 7154 7384 6164 7104 5454 4824 3294 2774 4634 4184 5364 3784 3154 2954 2364 3134 2834 2624 1414 2344 1764 1323 9563 8773 8353 7073 8103 7013 7773 7643 7553 6133 6063 7933 7063 5153 5833 5783 3253 4473 3253 5293 4963 4583 5473 4383 4183 3763 4003 2513 2393 2233 1963 2173 1313 1293 1622 9803 1583 0392 9983 0822 9922 9662 7512 8752 8522 8472 7972 8762 8692 8262 6722 8042 7622 7422 6542 6782 6082 6522 6032 5072 4972 4982 4332 3672 3522 4252 3802 4592 2602 2052 2302 2632 1762 2092 1912 1282 2132 1662 2272 1112 1362 0592 0562 1342 0762 0661 9622 0041 9461 9981 9641 9361 9591 9021 9331 9061 9101 8901 8391 8251 8131 9021 8541 8271 8251 7851 8101 8251 8301 7761 8131 9021 9091 8241 7571 8671 7731 7111 6081 6451 6991 6381 5891 6611 5971 6071 7501 5891 7111 6501 5701 5761 6971 5631 5811 5701 6531 5951 5771 5441 6171 5241 5631 5261 4881 5371 5031 4221 4871 3901 3641 3411 3151 3361 2611 2911 3171 2321 2501 3121 2381 2151 3131 3391 3571 3201 3451 2851 3671 3081 2661 2431 3021 3051 3801 2061 2851 1851 1821 1921 1491 1781 2071 1771 2251 1511 2021 1921 1731 1091 1761 1851 1511 1631 1871 1251 1491 1561 1341 1281 0571 0611 0781 1041 1041 0571 0821 1261 0771 1201 1431 1931 1791 1161 0691 0961 0121 0801 0961 1021 0711 0489361 0371 0341 0141 1041 0171 1411 0521 0811 0271 0531 0671 1211 0481 0121 0371 0539529809701 0661 0179951 0331 0119979989869239249838909189559519881 0199419841 018997951958978905904989899905920879941903944896940907854895894891893863860883823857828895846844802815848810805827836820739777822835760765750754810760768858788790803756742784776805819827796782745759697732775704774754749756719725718699715724736665674637668642710695652657709692679701636667717670677726649686669683652673664647629653719633638671680657694685654677697682679649589613624598575579599620651638605653600572578566629598616625604575580608584592546583580530539572575538598585564531524499579528498525480542502496545517488525539522524497531518523504484512496484467504493480499501516472496494462459480416505490476496540503537551502507536496477534522536544525499518501512522490501459522512539496479517507475523477484491502456490450517483478471493471507533505460467479463474426504442472509493429496436419441431374418437414413440431441403423392436430450429383418419418431453422455467437441468441430444397425410439379407452446471453447439401442463400391423394380409419427404400391355395379388365381383374408372385380395389380363425378373392396437375384434381381407382378395343346366349349373343425355349384363355350345362366359341303370343346371340338365346333346326329355313298331329339338336337336338326337291337298347313343325318318289316345310347327295322316335313337306360328310345340339311338335337296324315285293310314313322331337305298334317314310315306311323324335298310316312291280313309272278285310330293314304305324305321312348300318326332324318291273286332276283292282276278298318316284326269243300285279295293291298315276277282278265247287279267266234268246262259244262266271232280269250277256278 443100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

5 803 648000000079 642 3320001 404 212 831000000000770 224 2030000801 294 85500001 669 018 63000003 336 647 89300017 158 029 65400510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %166 202 14599.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %165 479 66299.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %722 4830.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %83 526 07350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.7 %163 234 68297.7 %2.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

11.1 %18 495 40111.1 %88.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 075 367160 50694 958188 591138 548149 039176 624229 76198 420169 89176 40866 212103 980108 49253 883126 99289 456101 324138 036175 270182 488176 167207 520161 831270 052451 58228 194797 08142 79541 71495 05186 10937 615103 99741 48241 14077 46993 99223 423139 5932 280 82898 94898 459153 132132 931239 510214 510299 268521 72562 91681 59872 01794 61851 76493 13890 28268 087228 93573 627133 313149 827 884051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.54%99.55%99.55%99.57%99.56%99.56%99.56%99.55%99.54%99.54%99.53%99.56%99.57%99.55%99.53%99.58%99.54%99.54%99.56%99.51%99.55%99.56%99.72%99.38%0.46%0.45%0.45%0.43%0.44%0.44%0.44%0.45%0.46%0.46%0.47%0.44%0.43%0.45%0.47%0.42%0.46%0.46%0.44%0.49%0.45%0.44%0.28%0.62%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped