European Genome-Phenome Archive

File Quality

File InformationEGAF00004857550

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

79 215 216151 104 298227 669 704291 592 690330 088 321338 182 543319 037 988280 375 557231 861 911181 741 082136 041 07197 648 26167 737 57445 592 78529 906 31219 235 52612 189 4497 695 1904 872 0463 146 3272 091 3151 440 4591 050 482808 511648 012544 087465 085409 103364 414327 767296 961267 938247 671225 737206 263190 597176 063163 438151 257142 344132 804124 773116 819109 073101 04596 33889 23983 26278 77774 50770 68366 98362 08859 28856 50152 75150 37047 86745 13044 14141 90540 09738 65336 40635 12634 42333 65231 42830 96830 11428 84628 28927 82726 64025 57524 87624 21823 37922 97922 20822 01821 16921 03220 19319 91319 34319 05518 57017 60617 14917 06216 16916 24415 66115 51215 23014 95214 92114 46614 22013 60713 58513 14012 99712 68412 29712 03111 92112 06811 71911 66411 20811 36911 31110 79310 69910 54010 5189 9709 82010 0389 6349 3019 5739 5929 3169 1348 6938 7108 6048 6098 5588 3847 9827 8237 9287 6557 4957 4737 1767 4167 2177 0487 0197 1287 1927 1567 1066 9246 8276 5846 7726 6416 3356 1996 2386 2186 0786 2295 9445 9466 0635 6815 8725 7225 5675 6175 4235 4945 3445 3945 3015 2405 0245 2615 2244 9955 0164 7224 7964 7944 5794 7474 6284 8724 8244 7534 5834 5784 5454 5964 4444 1964 0754 2044 0974 0414 0234 0343 9033 9323 9503 6913 8034 0243 8123 7003 7853 7783 7483 6913 6573 7593 7203 6943 6983 5033 6043 4833 5763 5623 4683 6383 5483 5173 4103 2973 2863 2523 1563 0733 0473 0282 9853 0363 0373 0292 9642 9272 9802 9652 9372 8292 8592 8162 8422 7732 7932 6852 7372 7672 6672 7122 6392 6362 5832 6092 6632 5502 4532 5002 4132 4182 3922 4552 4532 4242 3262 3312 3862 3122 3142 1812 1582 3012 2212 1812 1662 2052 1362 1482 1462 1512 1602 1572 1752 1272 1142 0742 1582 1052 1231 9762 1112 0372 1051 9592 0102 0072 0331 8951 9651 9371 8821 8601 8961 8641 8791 9421 9261 8401 7921 8221 7771 8671 8341 6811 7121 7931 7191 7351 8021 7771 7101 6671 7101 6561 7691 7061 6281 7001 7231 6831 5911 6361 6171 5411 6101 6951 6331 5451 6371 6551 5801 5531 5411 5501 5501 5571 5991 6771 6101 5491 5521 5301 5031 4721 4701 4491 4621 4521 5381 4821 4301 4071 3981 3941 4371 4451 5321 5281 4391 4621 4701 5111 4881 4451 5061 4171 3941 3821 4201 3491 3351 3151 3661 3741 3141 3371 3331 2741 3041 2291 3451 2821 2121 1751 2921 2071 2031 1881 1581 1811 2581 2111 1991 1191 2141 2011 1581 1631 1991 1551 1641 1391 1771 2101 0721 1131 1421 1061 1181 1311 1971 1191 1391 0821 0561 1511 0601 0891 0891 0781 0801 0501 0421 0421 0791 0611 0111 0841 0421 0281 0709681 0221 0351 0389961 0239289459501 003979949939969978922954882879925902877902902798869918883902920895774844862837875871853894863901854837841890834942875854850897906780897808867752821776856812785806741788815811772738789784795788791800776720808708756810787751750722765736719742744729770762760766692737695678685654699623665664700656651660700647649665687636631655643608671692638623610625636587600653659624625612669681632571665650604660626663612642616663610583603642587588551625620640660607581645611624676604619599582637627607591651624634592599665579623616615591531568558505571581595607574624604583578581594567556559557623605572573592596596554580591541551581545566592548531564550502543508531545551548524494576547488541495545467526535552554570569520527458492508454477530475483425472509479498496500509453488499523523545531503499501498507469492489494478485472482436472469467437470443455495483445457492445472433459484468448493477446516493492486494478513474470504521470435499497500458483444450462486492476473507485532535537550570557541514547531551517559506521541531549525478512518492503515489547464493527488513506501479472442511517508476484490450477516514478493477495462466476508461441434410441400426420398388405401426415390391389419380432408378355395375359392350368360364380395419376361356360335374363384379353363351350308357344332306357374313365327324342355357333376375336342326315319344324343393334320309301353303296337329353332319312326322341421365358325337303293340319315321307322290288340299282307297320298284293295274292296319298287284288265286307282271259284282255241258289270271291265288267241285250255273268266248255236268243228244266272281259242258251284267252253239267211254310 589100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 460 9030000000160 904 0280001 149 741 242000000000675 874 6040000841 422 69000001 494 208 94800002 994 103 80800014 230 420 39100510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %141 302 86399 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %141 097 65098.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %205 2130.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %71 354 75750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.2 %138 700 53097.2 %2.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

16.1 %23 030 50816.1 %83.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 255 068150 75093 395175 379134 602139 229153 720209 01295 204159 26377 32567 06490 415105 83356 875122 66695 050107 174135 246191 773198 301181 350226 747174 113272 691441 59727 630730 48241 31439 67774 41281 61536 778100 33041 37540 92864 51489 12724 073131 3401 889 35488 94284 606142 477119 729223 365192 087300 738459 31858 23472 46067 76679 47037 98273 41274 13251 674198 86357 309113 975125 872 035051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.85%99.84%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.87%99.87%99.84%99.88%99.86%99.85%99.85%99.9%99.81%0.15%0.16%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.13%0.13%0.16%0.12%0.14%0.15%0.15%0.1%0.19%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped