European Genome-Phenome Archive

File Quality

File InformationEGAF00000643364

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

399 223 42476 797 43517 317 4367 618 7635 045 1454 097 0053 563 9493 202 6602 922 4792 707 7112 523 6052 371 9682 249 5172 140 0512 039 6171 952 6381 873 3971 799 1661 730 3191 664 9941 603 4541 551 8091 500 7401 445 8551 395 2871 347 9401 303 7691 265 9831 223 5141 187 3721 147 6401 109 5621 074 1271 036 353998 107964 507932 626900 601869 123843 655814 121783 403754 165727 522700 713674 455647 172624 332602 768579 298558 776536 641518 633497 789481 355461 513446 458429 468414 267399 305384 352370 579357 360344 566333 017321 260309 445297 661287 592277 180268 263256 981249 381241 065233 432224 960217 030209 377202 780195 391189 028182 719177 311170 431165 823160 435155 614151 120146 579141 933136 901133 618129 732125 895122 135117 917114 826110 860107 541104 072101 42398 63495 07492 46890 37488 22085 70382 92680 62578 23476 14273 07671 20469 09667 56765 93264 02361 93959 79358 63157 68355 62954 27952 80451 65150 12249 68947 81146 67745 51744 54243 18042 03041 15440 08639 25738 18936 80936 50235 26234 53233 61032 76332 23131 41730 89830 38529 53428 70627 69227 48826 68626 13125 31625 06824 26923 74423 24422 50321 96021 53420 97020 78320 16020 06319 41319 06818 43418 04017 66317 47417 01616 56816 28315 99015 51015 02814 77814 51514 25914 04513 30513 35813 13112 93612 77112 53512 03611 89211 56711 40410 82710 84310 62710 34910 1389 9559 8439 4429 2349 2919 0999 0758 7348 7068 4478 3748 2138 0487 9387 5847 4887 1917 2067 2626 9996 8946 8836 7516 5426 4496 3726 2026 0395 9105 9225 6375 6255 5695 4455 2445 1765 3514 9905 0185 1034 7924 8114 8464 8764 6964 6934 4424 4244 3644 3074 4744 2124 2324 1874 0494 0743 9023 9033 9243 8413 7533 7253 6163 5073 4823 4993 3343 2683 2023 1663 0313 1143 0313 0313 0133 0872 9722 9252 9192 8302 8652 8092 7962 7302 7012 7052 6332 5552 5672 5482 5542 5702 4052 4912 4282 3682 3082 3032 2942 2972 3192 1662 2372 1542 1342 1442 0782 1512 0652 0482 0861 9961 9701 9991 9801 9421 9481 9011 9051 8851 9131 9701 8851 8001 8831 8371 7921 7961 7941 8011 7551 7091 7941 7551 7271 6961 7241 5951 5541 6691 5951 5881 6211 5301 5711 5551 5721 5381 5131 4921 5241 4441 4451 4161 4561 4541 4971 4091 4521 4571 4131 4011 3981 3451 3441 3191 3451 3511 3061 2761 3471 3481 2561 2621 2271 1461 1661 1551 0991 1161 1071 0871 1101 0561 0851 1581 1201 0229961 0571 0081 0619941 0051 0141 0471 0071 0171 0351 0561 0681 0219759761 0231 0589749379971 00997594790496987491193887990088490686590587585386785583681481887483684581976477575875682672769875076777472572875376973675176175376572077671771471468171972375076069676370166868468070168968369770074466665472577367665769163968365263661063962361161357153264461963562361160464259757760162862057461359356260858258762455556754953258452354656954252550050954349949552352750351550652755245649047149047045346946447747346348043744341341041642540638940138839242040839238339637939334634836237036235636137736436234436230035831632733733433932231032331530331727233230330225630129029229127827331425928023828826526124221425524825423521919822423321521019622923120322522219819019417618815519215616618317714818817818616019216515016418218018817714216314516116814714616514816716014816217015917116114816314614618814515714915715616816615818219716313016013612214614415314313314513615413314914512512413312513812614111914313912613311714313114915617212913512514111212612911413415212513013011811814614512015112711212612813713210912312414212412413312613715713913715714013899132148115131103134118108105107120125113991239110013498107132106911051209584989681849795978099791018893939574816593908187906886907071939977858072826895817084808381828686558376698910285717167929770727680778768625361815667476056475066595043545158425143404553485634446353475147355942484041454338374438354237394134443331453824333641473138253440363931395238364937454247484838443637393155373543394144373838474741403545433443463835514832414047475132492934434245434360494240414442483433383346483232313027333634403232353231312711 326100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0089 534002 335 0712 967 01512 642 60313 071 5845 969 5387 602 2063 196 6792 200 2346 033 9632 448 2005 701 8985 651 7056 017 96611 299 4186 879 9427 412 7935 457 43510 378 65414 283 87816 736 07719 346 81231 436 43428 988 87927 808 76931 799 14361 744 697106 218 98662 874 250105 397 654202 153 370219 007 733149 018 108371 654 417318 839 179452 423 262579 371 060872 031 90400510152025303540Phred quality score0M100M200M300M400M500M600M700M800M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.9 %49 981 49698.9 %1.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %49 861 07098.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %120 4260.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %25 256 60750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %49 784 22298.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.9 %1 484 5732.9 %97.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 679 8736 1494 91610 6015 51213 31613 04424 64319 91359 04245 87114 384100 89919 71517 987211 54538 928127 32480 58029 879158 1311 96995 742345 8901 42520 4972 0362 1232 1701 132 4845 5074 4085 4847 4267 52210 358212 780623 28519 5865 42832 59018 6783 09449 2283 3585 210129 3246 33810 1149 37624 1589 78433 66430 74648 35084 582194 01240 632 236051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.78%99.74%99.85%99.47%99.85%99.9%99.76%99.79%99.52%99.56%99.85%99.91%99.88%99.89%99.72%99.7%99.75%99.79%99.84%99.86%99.69%99.75%93.78%99.61%0.22%0.26%0.15%0.53%0.15%0.1%0.24%0.21%0.48%0.44%0.15%0.09%0.12%0.11%0.28%0.3%0.25%0.21%0.16%0.14%0.31%0.25%6.22%0.39%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped