European Genome-Phenome Archive

File Quality

File InformationEGAF00003441191

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

125 897 650236 548 741339 431 081401 830 115410 114 872371 218 152304 179 303229 235 248160 804 049105 895 25766 202 27639 667 78522 989 81713 036 1287 339 1514 214 5292 516 6871 602 8201 105 621829 784656 613544 151466 594405 292355 885315 832280 792252 209226 130203 882185 982171 445153 885137 040125 737115 422105 32697 20688 96383 88277 65071 90366 08362 61659 28956 13352 35850 40646 95344 83143 10242 09939 47238 36037 00434 36333 68632 34131 11830 24628 63128 37626 97925 78125 85725 02024 77523 92822 95622 04521 35221 22321 49020 30319 60118 75818 61618 35718 22617 51717 34617 04816 60116 35915 81615 37914 93114 87114 33213 77113 66213 75413 51312 93912 62512 43012 35912 49112 30111 76611 49811 20410 91710 83010 56210 1139 90710 0829 8889 5879 7689 9719 7769 5229 5419 2259 1178 7718 9398 7588 6998 5018 2348 0018 2318 2347 9907 7047 4837 5927 1617 2967 2156 9606 7717 1576 6706 6756 7376 4866 6116 6446 3496 3576 2456 1246 0935 9335 9035 6765 8955 8135 7835 7815 7255 5455 3985 4465 4025 5085 1545 0914 9174 9724 9145 0564 8405 1344 7904 7294 5454 5474 5494 5784 2354 1644 1794 4034 2624 2064 2394 0744 0604 0224 0273 9823 9894 0063 9883 9563 8993 9003 8533 8773 8393 9503 9043 8163 7333 6903 5413 5783 5053 5593 4133 3833 3643 1953 2853 2023 2323 1403 1393 1493 2823 0883 1113 1423 2423 1113 1173 1093 0352 8282 8662 8082 7952 7912 7722 7132 7112 7932 7542 8062 7692 5472 7212 6132 6542 5842 4452 4282 4012 4622 4532 4512 4512 4572 3982 3632 3392 3622 3492 3342 3162 1562 2302 1312 1382 1872 2182 2132 2292 1242 2502 0552 1232 1992 1231 9912 0672 0861 9411 9711 9371 8782 0421 9591 8781 9421 9331 8261 9051 7831 8581 8471 8161 8521 8731 8321 8301 8531 7621 7961 8091 7091 7131 7921 7751 7401 6821 6621 7261 7081 6861 6631 7141 5751 6281 6061 4991 5281 5521 4341 4701 5001 4501 5251 4971 4811 4211 3551 4001 3981 3991 3291 3441 3551 3301 3381 3141 2551 3501 2791 2981 2791 2341 3081 2831 2741 2511 1991 3461 3411 2351 2901 2251 2531 2481 2791 2011 2321 2301 3181 2231 1831 1911 2751 2451 2721 2661 2481 2781 2741 3071 2491 3001 1831 1611 1301 1141 1301 0961 1481 1411 1331 1421 1621 1331 1981 2481 1491 1151 1631 1151 1811 1311 0951 0921 1151 0671 0091 0779751 0469519969691 0769661 0189791 0051 0229639529931 0449448759631 016912921917890986932972956935940933927920952921884951922930932916977939938954955982924967985922932900936899905842932952847838862918887890863837881797885818870850836874804862879768788838806803843819808757837841761828761773739787865798801825820859854863811882883861842842797823796822853813834879822774809842821863818789812769768789835807807824770798795775817800777775808796748753838769761688750758743723703688714758659693664643736651624627641629657675663624615671609611597637561598590586572592606539573539541544556582569543549578563529483500559572549551564514534561547510540500511511526470522508515559505555567506516517484475499504506509511505492508475521528527453488491500505502475487519525484473471427426443488473474476469491490453456465471436483499405438503440491483447469485463438412430466455434424451447478448456446470418449447426439395412445424441401434393425378453404403452434401367427415422399395352413418427396431371381396377342405366394375381327344392381333352363383368364345363358329333348339320355328328312329332367348333348339323344355328354360375344294312336345305304310329308330334310329258313318283316295301334333282245287255289323319289282297287301339285320313309276309289271328292292292296329319298280303298281296311281309273276225294282282291281264257236252242283254255285278253250261265268262268274282268244264265268289315279273271246260264264279261240255238241237261259265255279260238266226252268270211244275263236249237249258258265248273256266264287246272256243243269267226218228232273260255263250286288234251209269249237241211246242224213233243245236231231236216235206207245207243229242215209240217206218211227211190219204234194205255232242237219233237224255190220194213218215225210197235225222219220219213239207200176188216213207212207195197188243 389100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 144 000000000027 595 935000809 920 997000000000462 130 3860000529 535 18800001 126 611 54600002 382 991 27300011 845 531 89900510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %112 939 79699.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %112 740 37899.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %199 4180.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %56 908 81250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.4 %110 888 34497.4 %2.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.4 %7 284 9556.4 %93.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 921 963125 00073 980144 235107 955114 494128 366175 35974 999127 94958 44251 91774 12583 81243 04896 74471 91980 186103 692139 453147 124139 581161 518129 801209 903344 52822 127583 42033 26032 92365 42267 84629 39978 59932 42832 38353 45171 53118 063105 0961 529 36668 44667 743111 81695 373178 973149 668242 944360 20744 03456 58348 93562 90531 56351 76557 60841 366150 47643 39586 664100 961 060051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.81%99.81%99.81%99.82%99.81%99.81%99.82%99.82%99.82%99.81%99.82%99.82%99.81%99.81%99.81%99.84%99.83%99.82%99.84%99.81%99.82%99.81%99.88%99.81%0.19%0.19%0.19%0.18%0.19%0.19%0.18%0.18%0.18%0.19%0.18%0.18%0.19%0.19%0.19%0.16%0.17%0.18%0.16%0.19%0.18%0.19%0.12%0.19%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped