European Genome-Phenome Archive

File Quality

File InformationEGAF00000101735

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

25 985 46465 241 991125 552 654196 141 666260 232 321302 797 772316 573 095303 022 971269 794 668226 718 836181 358 432139 532 050104 213 56476 169 78654 775 47339 115 48727 948 34020 094 95314 660 45310 949 4868 392 6696 610 7365 373 2174 485 7023 828 7503 353 1752 990 5452 703 4022 469 8672 274 5402 090 8431 935 6421 792 5841 653 1261 515 8771 387 1371 266 4191 149 8471 041 036942 841853 951775 084704 480640 965587 079542 433504 424470 561440 600414 884393 415373 909359 355343 429330 230318 742306 184293 979282 926268 919256 277244 544230 269216 976202 656190 732177 226163 295149 904138 000127 399116 250107 03197 48989 90483 02376 03869 81965 42460 28755 72852 26548 54646 00843 36641 60039 65037 19436 03234 96233 46032 75932 11531 33931 30430 69030 68129 67829 44228 97428 73828 62828 71327 85328 10327 55127 80526 99226 42926 18926 01825 57825 48924 91324 84424 46423 95423 37122 54622 22522 00221 46820 94520 15819 47318 63417 74116 90916 35415 50314 72314 14313 40812 70712 13311 89411 20210 69910 0639 4538 7198 4037 8077 3907 1116 6535 9625 6385 2604 9134 4194 2514 0183 5633 4483 1722 9302 6352 4432 4272 2402 1251 9802 0411 8161 8281 7081 5671 5421 4411 4531 4681 3881 3951 3271 3811 3271 2531 2101 3141 1471 1841 1331 1651 0661 0021 0849771 0099519749591 0531 0189271 0099249158838548668288088077667818538207828148308238257227577587617287427707377607627417097437347177287767337387467427317097096696997286837476926237026657126987377636756956736466956506045916056025716125876065635276235645205465985785635275375465776055995334995685185735565515555305405305485675665445635435315035225115195014534665074374765015675205034504984804825014665084654854564534914525264594884885054954914885014955005094434744744945204724954844804414604755004945255365215314995405505735365906266996236836276437127257367757848447697498528178528649049069359529939481 0091 0341 0371 1031 0151 1181 1261 1411 2151 2621 2451 2911 3001 3451 3521 4001 4061 4441 5191 5901 6271 6281 7311 6641 7681 7431 7351 8331 8781 8751 9131 9622 0662 1072 1092 1582 2292 1212 1682 3442 3882 4542 4842 5562 5432 6082 5732 5762 6722 7442 6712 9272 8132 8542 8582 8862 9612 9592 9363 0893 0963 1753 2253 2313 3123 3193 4733 4413 4773 4333 4343 4003 6113 4713 4633 5653 5273 5773 6083 5333 4573 5923 5703 6713 5493 5403 7113 5663 6573 6573 6333 6973 7503 7963 7193 6053 6263 5393 4823 4983 6283 5213 5713 4113 4283 4323 4443 2213 3143 2313 3433 2543 1813 0243 1222 9833 0193 0432 9492 9722 9602 9142 8742 8082 8612 9322 9072 9492 8522 8412 8532 8032 7262 7542 6952 7022 6312 5942 5622 5822 5412 5842 4402 4392 4052 4222 4422 4082 2902 2662 2582 4052 3402 2882 3082 3022 2002 2052 0402 0971 9931 9882 0541 9191 8601 9601 8041 9131 8601 8141 7681 7351 7071 6911 6021 6061 5791 6231 5281 5491 5901 5311 5021 5031 4681 3711 4081 3931 3911 3621 2941 2871 1881 2421 1831 1811 2161 1371 0801 0651 033995949955944876858893817756788733711611683627650650593594603519536567520532538486548507493453488441425425410389389345365330359351290312314300300265306261240264255251240240251221217236248227246254223215193228191188176171179176166151156159177192193168170157153161146169142156159175145157133143158139167150156147134163130115146141128141121122138142110136135134115133143132150118132141146125150126119131137144143132151140156127125120135133113150150158145177144150122127110153118131981561171201341611271181311271281531221151151041261241331141121179811512512412912210210811479931111241291069411310011085991201011001141171268912211312111910011713012010710812011012214412912712714310812512914411512211010913711013810810410511411310912011112710994113125110971101139612211111574105958810899959211489110891179610694101901299611010487108112101105979794979985899995101868212211296969093115888889881019390888885838994991047791969193101106721028710175839510410189116719510410910382961009810010378841029890868590918288779788938891110898498104869810194939687948392878186927672778884738893957788118718271905980848365888987868483967880113 076100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

11 244 0832 271 7293 778 5016 496 8966 141 82921 369 35774 236 31022 799 1107 475 28362 525 532221 926 477130 194 43945 283 51021 895 76011 082 54713 899 05942 079 47358 964 610174 843 205173 210 354252 728 216118 249 248138 045 707157 641 17394 807 659229 130 255165 959 909138 529 961274 785 477281 318 914307 678 720326 422 545428 859 571409 452 766668 601 570837 183 438779 384 1741 319 945 7031 628 375 4392 060 755 6643 103 265 6622 820 432 5953 248 043 4822 923 003 8271 529 638 059695 607 505566 263 714157 600 70270 789 36815 393 113005101520253035404550Phred quality score0G0.5G1G1.5G2G2.5G3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

96.3 %258 548 61996.3 %3.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

95.6 %256 874 27095.6 %4.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.6 %1 674 3490.6 %99.4 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %134 298 06150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

95 %255 064 68695 %5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

1.2 %3 138 6121.2 %98.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

17 707 32969 99636 173121 31735 39659 87142 54164 59051 690419 727193 558146 710362 09590 81389 126810 301101 854928 153229 28768 565368 74321 928207 253950 85513 342194 30112 69413 04313 45713 876 27417 83113 38814 58817 76814 71220 630583 2014 008 34043 87625 91867 14257 38228 788112 12640 80255 126336 34272 63485 56086 736164 65259 134165 228122 194175 592271 882545 506224 088 062051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M220M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.34%99.24%99.58%99.49%99.58%99.64%99.48%99.6%98.99%98.19%99.1%99.61%99.64%99.62%99.45%99.22%99.05%99.34%99.11%99.58%99.48%99.38%92.66%99.55%0.66%0.76%0.42%0.51%0.42%0.36%0.52%0.4%1.01%1.81%0.9%0.39%0.36%0.38%0.55%0.78%0.95%0.66%0.89%0.42%0.52%0.62%7.34%0.45%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped