European Genome-Phenome Archive

File Quality

File InformationEGAF00002492833

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

149 062 589275 893 987387 435 454444 041 399433 405 564370 358 344283 032 124196 804 854126 530 75075 911 25543 139 15523 561 45512 542 4276 669 8273 637 7322 121 5361 354 209943 890719 499579 066487 412417 300366 821323 567286 591258 011233 840209 825194 002175 788158 912144 758131 590120 777109 030100 80991 32882 46177 00570 75465 70561 90556 96153 77350 94847 50144 88741 77540 13738 74337 05435 82534 88133 06931 43430 06029 18227 18127 12526 46825 08024 67823 48422 73821 70021 08520 41619 59419 11718 28417 27616 89516 30315 58115 78715 42914 77614 63814 49314 27513 64313 22413 09413 19712 71912 45612 31512 18011 96111 70911 33411 07610 67710 5459 9929 9259 8639 4099 4019 5199 3769 3529 4729 3738 9608 8478 6558 3668 2837 9547 5907 4567 4567 4577 4107 1587 4087 2497 0167 2077 0426 7796 7646 5626 4146 0666 0626 1666 0016 0885 7295 7085 6855 5335 4175 3465 2445 1125 2215 0924 9235 0904 9404 7194 8474 6294 8644 5624 6474 4714 4584 5064 4064 3624 1774 1453 8653 9603 8924 0234 0363 8613 9143 7553 7843 7923 7773 5553 5333 4933 4003 4233 4403 5453 5533 4563 3163 2173 2343 3223 1273 0963 1053 1623 1243 3263 0322 9663 0862 9443 0982 9352 9042 8582 9122 9292 9102 7552 7382 7732 7742 6702 6892 7692 6832 7212 5972 7052 6292 5862 5672 5072 5252 4742 4802 4422 4262 3602 3572 1772 2832 3092 1912 2292 2222 1022 2072 0302 1132 0391 9712 0381 9652 1191 9812 0101 9591 9071 9361 9161 9121 8071 7951 7791 7701 8451 8631 9121 7511 7001 7091 6761 6641 7251 7151 7581 7641 7001 6701 6491 7581 6321 5861 6291 6331 6681 6331 5831 6191 5021 5851 5411 5321 4341 5031 5311 5391 4591 5291 5191 4711 4341 5081 4771 4891 5731 4841 4471 4781 4301 4861 4921 3791 4201 4101 4161 5221 4441 4741 3691 3681 3591 4081 3481 3121 3551 3201 2901 2971 2771 3001 3211 3401 2241 1501 1171 2221 1651 2481 1341 2561 1771 2221 1871 1851 1481 1421 0891 1361 1581 1331 0751 1441 0861 1291 0761 0561 0091 0891 0301 0351 0641 0401 0861 1131 0851 0651 1131 1081 0951 0001 0811 0301 0431 023972988984999903911918884947930918877921865893909947927914871896913880873843862945865841761843775820772808844778820843807775773849812834822788775823832773740763732810771761770757721745717754767730709737710716671760740695738716696705704676723781726752712686658698692684699676648719645661668636666695649632713675695630684652643667680741748690776721681637671690647673667695656697615686643588636657646613586588584592544557575549571577528611571582566603603539593574586595579559593520540570528525527551516521447430506482499522475487475465488458448444434462443439411436436459409432449439422456401417406425417406425379403406453423423440391395439457419398379375405393399425410404396401395399380407383423402394382411360378412358409346424396409374392367412365417347401397390381410401379400398370416390360350367365375346317331307361368348340372368380341351382370355330343342352349338329339390378343327325357315312328293294280319300311297296306322327322363315327350368344293309310312296337338309318303341341322319317364351306332300329326347307353310335308330335316352368355356337356366316372342313339301332363323316330304333271331320342315275300312285316282307313287269280276292287263291317315312304289282279305282287295298343282313292294307287307309290280297253287300282319294282308296320301301288317332301301323319272321318299285316317286317278300264278296264279275301283299279306295273309282311306315297330334287284296291305303293299299324310303296281302265316297305286296320261248280268298272299311303283286279290306263283298261253274242260243283278263273257276275289267262274277276272285257293285285294276289318282306321299289299315329279267268268282268292308264300302303304289315332303304310310303285294276322297370287300341310316315310308315297352322299322277296302315297261305293275313254296263290319280288312298330301337291297300334289288320317299298300281299307305318319303273271283307302307327300291299315318316310299267291287255241242298196 181100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 820 763000000057 351 539000944 291 188000000000531 504 6770000576 772 39900001 149 918 31500002 238 739 32400010 120 289 35700510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %102 889 53899.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.2 %102 585 54499.2 %0.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %303 9940.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %51 724 13150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %100 823 66897.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.1 %4 195 9204.1 %95.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 962 05195 79960 056114 78085 56186 401102 042130 32160 843104 16349 62743 25663 40067 63637 63580 59160 59565 82686 042114 278120 061115 751151 025112 330182 321288 45319 099509 06227 98426 36954 53552 84125 08865 72727 61326 84645 98557 54316 15186 3971 311 61763 73663 630101 05987 002157 035137 539206 627325 69942 84553 01948 20260 22432 46254 82458 35642 989141 77746 16185 45092 443 145051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.7%99.67%99.69%99.7%99.7%99.7%99.7%99.7%99.7%99.69%99.7%99.7%99.69%99.7%99.69%99.71%99.71%99.69%99.74%99.69%99.7%99.69%99.79%99.64%0.3%0.33%0.31%0.3%0.3%0.3%0.3%0.3%0.3%0.31%0.3%0.3%0.31%0.3%0.31%0.29%0.29%0.31%0.26%0.31%0.3%0.31%0.21%0.36%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped