European Genome-Phenome Archive

File Quality

File InformationEGAF00004840359

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

51 353 24497 312 535152 292 909208 352 888255 233 464285 437 010295 365 752286 039 221261 432 278227 026 020188 450 916150 707 899116 393 91187 219 05963 597 95845 319 17531 627 73721 743 36814 703 2609 864 4016 600 7834 417 2452 972 5132 030 3231 410 4761 011 192746 772572 378458 314378 664320 845279 563247 637224 009202 423184 766170 845158 595146 968139 312130 336118 610111 962104 10698 51292 09887 15781 76277 24871 88567 07963 22859 62856 90053 90850 56847 15344 83041 51738 61037 85735 51334 24132 56531 62030 32429 41528 57227 14826 06125 27824 22623 58322 30421 99721 52420 95520 30119 74319 09018 45617 41317 11516 67516 30616 12315 61615 16714 53214 27914 44613 63613 39912 63112 54912 86512 46412 11912 09711 77611 42610 98611 11710 73010 37710 44110 15610 11310 0269 6629 2639 1869 1958 9578 5618 3268 6968 3408 2448 0287 7777 6577 4957 4427 5817 5887 3957 2366 9816 8626 8486 5506 5836 3186 4616 5586 5356 2936 0385 9545 8295 8875 8525 7355 7195 5935 5365 5425 4245 3895 2584 9905 1635 1675 0274 8615 0714 8924 7104 6804 5464 6194 6104 2824 4084 1794 4464 3044 4094 1144 3084 2004 2214 1534 2383 9683 9173 7973 7313 7673 7053 7423 6773 6053 5383 3553 5053 3993 4453 3323 4643 2743 3003 2593 2083 1323 0482 9782 8922 9322 8943 0402 9022 9602 9452 7392 8102 8282 7542 8572 9082 7402 6682 5702 6162 7752 5192 6632 5592 4962 4182 5632 5132 5092 4402 4342 3862 3802 3492 4672 4052 3932 3002 2462 2272 2332 1612 2222 1242 0111 9952 0392 0812 1272 1782 1112 0692 0432 1752 2472 1382 0941 9982 0822 1672 0301 9761 9691 9491 8931 8491 8301 8191 8221 7661 7451 7811 7001 7961 7411 8511 7531 7731 6571 7041 7131 7471 7411 6261 6301 6051 6731 6571 5631 5821 6451 5961 6091 6171 6061 5031 5961 5251 5721 5181 5531 4731 4831 5121 4451 4971 5001 5081 4131 4861 4441 3811 4091 3611 4201 3621 3761 3461 4011 3711 3561 3631 3961 3481 3551 3091 3191 2721 3421 2691 2781 2881 4081 3221 3351 3031 3421 3471 3211 3401 2561 3001 2851 3411 1761 2691 2281 2111 2041 2031 1501 1391 1481 1841 1431 0881 1071 1361 1481 0951 1181 1101 1121 0461 0231 0951 0561 0121 1371 1051 0591 0371 1111 0171 0149981 0419789941 0139931 0409789939281 0201 0091 0609961 0009609389249209761 0089999449979259499669049289721 0389709139399479429008819718878979108759339148738338948098128458598698529141 0088561 050878907852789783798802800806857837788783798801734765808766787818810798808804850809741794730770759849703745727733752766706679738673701742719686669701664758772716691698689714686686707677681633655684673669688682725692643702616660681645630671643643638641599614577619628599597628615580614597556587529559585604552558534545593551535506551526520511572546550534576580567485518518520510540501527525497510571532490535515484459485461502488532530509489502506437505533549539567545485595564528546540558613525552533530506536508499483525489496508520531516544508522493520518525552539506500459487529503524464447515482438429473471435415449398415390377391426379416399373420427386387380379364441410380378363371377362387386379391359360402379367369353316356350337384341415391350351390377428407393367403343350370349384353379344365338381350377359316390342337327318348356328337349329356351352351331365361366343343363364314335366353370326331334337336322322333349308366339349323325348342338330337352309316341322343343329338334313300379360343335357301382311344374367334335323354305301329321321316287310312317306297319307302279282337282285291269310290312320308276287277271291306298281290325301326297332383334278280317297321339301289266299285295275270271302262281303268296278270275290300316292268299287292324313287291298297307320274305278316322315328315321322303335308302354323348320303317314301292282317293315323304298261297276306273279291320267257273277322283279271290290288263269308333279297279273281279302257262283291266257262294260244290245270239263258260242230246262278231250270271261280216259264264240250232227239236268228254285252226250233254206220222241233257262221216238246224208229225245230226243240195245197225248237186199217223221209215258 031100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

9 335 061000000042 955 075000847 351 087000000000497 467 7060000554 177 47000001 285 884 42400002 692 293 31400018 762 597 95300510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %162 958 99099.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %162 814 58499.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %144 4060.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %81 761 79550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %160 320 63298 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

17.8 %29 075 78617.8 %82.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 766 484123 21574 356151 613107 490115 648129 185177 36788 520146 18965 84357 57375 92590 13947 849111 93775 77884 260103 940147 221147 315156 752191 233150 129241 869409 44527 152746 73238 31435 68869 49773 99937 73996 16237 70337 98558 12780 93223 006128 9951 749 53189 32284 733146 710120 325225 901204 610286 486547 86056 96876 39864 07286 20738 95081 46273 84956 262203 46355 227114 808149 824 314051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.9%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.93%99.88%0.09%0.1%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.07%0.12%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped