European Genome-Phenome Archive

File Quality

File InformationEGAF00002340469

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

468 042 801594 151 528559 033 109429 258 832282 758 889166 209 37389 277 08644 884 65921 483 40110 093 5934 768 0402 367 1621 289 180780 652532 828396 619311 737253 777211 864178 941153 504134 173117 860101 72590 39381 68373 59667 79362 78158 40055 43950 92447 34245 45942 77539 87037 93335 57033 23632 84630 92930 50228 52326 79526 23125 81824 60422 84922 79122 00420 89820 35119 69118 95519 06618 18517 34316 78015 90115 50715 21814 48114 22713 99013 67212 61212 32612 21112 02511 36111 03210 97111 05010 44910 43710 2509 9049 7889 6659 3949 3589 2018 7698 3058 0667 8777 9677 6907 5327 8377 4077 0826 9976 9456 7506 6816 3246 1185 8175 8935 8455 7195 5285 3485 3995 3795 2095 0904 9885 0555 1964 8944 8084 7904 5894 5854 4064 1204 2394 0254 0313 9923 9653 8213 8953 8713 8063 6443 5483 5253 3323 4643 3633 1943 1613 1423 0593 1492 9692 9872 9262 8752 8432 9442 8492 7522 8982 7902 7802 9892 9382 9573 0092 9332 8302 7412 8532 9452 8632 8762 7852 7292 6132 6832 6192 6402 5552 5612 5562 5602 3372 4082 3232 2542 2972 3142 3772 3692 4172 2902 1702 2281 9862 1822 0872 0521 9681 8781 9691 9261 8721 7521 7791 7111 7101 5821 7101 7051 6671 8021 7381 7421 7501 7381 6871 7641 7681 6951 7781 6881 6841 7351 5621 6811 6051 6611 6011 6511 5421 5051 6231 4581 5491 4771 5211 4531 4611 3851 4041 4261 4201 5281 3911 4541 4711 4581 5271 4321 3791 3881 3761 3461 2881 3451 3561 3581 3761 3141 3381 3621 2621 2551 2581 2661 2961 3021 2881 3311 2961 3701 3851 2661 1811 3151 2211 3261 2601 2701 2951 2271 2381 1351 1911 1331 1511 0841 1201 0151 0641 0611 0771 0151 1181 1011 0371 0451 0131 0459901 0379139389339249359429299199438789181 00090494195790187679290786082086293983082783885679483586079674978978676178572273178580672174174586673075275369172365974269764068865367064364067560962567764772470475266963569864765971764460362269662560159058663156966254857257759260158960156356552752656453854854454552645650548555353955951755953854853753953254656247046351347253248549148147045641745144043544046144141938740145037836242037639038237636639139538139440537837535337138839437137135537837339243538741939639538038438937436435435035134836938036134934932634236438436935335235433734137039038038134733433533730832835632232131233530433531731332729132330432230128830025331032434633631636832832533731131731234832632533227229427930031131330430530027728133131230531632933334233333532629933030331830832531330332629430133029831033130727235533332734030829530833232828927733731630030033230631629432434728628728930931532329230027827629928327826324531528533728727732730728530630031431331932429226228630331528929927129928528231230829228231228232032430232231331534136132131535432129527631630728430431432831030229028930329829030527227826430828329429130526328526929129429626227228627529729828425827728230328929529232529532634027130431127728929731029628227927025229226229126624424324024226226022022125524625223723922523227427326524226222826024124327126828223527320927229223923324824523422621923421223920922522420621820720623620522722622622920319721620720318218122721521221720617020021520622819920521221321520219820217120620821520721419418418722019423821122619422319922119922320019919916920220221319518718218215919317418917315514315613315914915215316217016718114118415818515217715213918015216418315814715616016916717515017715118218615615516916817816420118316416316016616817516416216419019018317017319118714419215917516115517820216715113018217517816217116616516718018415114915717218917818417318117217721120022422919418722021318918020119819521219620620523221719818623720520721723123119825021523318919120122121622122821319820421421418820722124918721721423519818822422021824022122520022020320622720922320918821318420520820617920818316817517317219018817817217718417788 238100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 206 187000000092 222 841000675 543 089000000000379 275 5370000425 993 5380000728 745 96900001 365 368 5930005 686 018 14200510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G5G5.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %61 664 78299.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %61 413 87099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %250 9120.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %30 974 74850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %60 647 93097.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.5 %5 244 4418.5 %91.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 581 72556 73034 36566 64248 93850 52260 92576 80034 59659 82528 59424 69137 56838 87620 14446 24432 91136 96249 50462 15862 33861 81675 79859 38898 315162 71310 374290 38815 31314 87632 23429 93813 09936 20815 57614 72727 90132 8268 77049 140830 04536 77537 84057 30350 66191 19478 470123 245176 85025 91530 99928 61034 91020 28733 84035 14027 58484 47430 53953 30955 701 898051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M45M50M55M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.57%99.58%99.59%99.6%99.59%99.6%99.58%99.59%99.59%99.58%99.57%99.59%99.59%99.58%99.58%99.61%99.57%99.58%99.57%99.55%99.59%99.58%99.73%99.44%0.43%0.42%0.41%0.4%0.41%0.4%0.42%0.41%0.41%0.42%0.43%0.41%0.41%0.42%0.42%0.39%0.43%0.42%0.43%0.45%0.41%0.42%0.27%0.56%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped