European Genome-Phenome Archive

File Quality

File InformationEGAF00002273137

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

532 218508 300555 318651 632822 8461 110 3811 558 5542 238 5243 204 8554 532 3596 207 0258 132 69210 236 64712 366 71614 411 29016 216 74917 873 21019 357 18520 870 29322 648 44924 945 53028 130 49832 455 78538 147 67545 296 35053 918 31363 866 12374 832 61386 396 82698 123 283109 400 452119 666 109128 383 690135 097 920139 508 731141 415 182140 683 034137 493 920132 066 930124 726 320115 846 337105 863 06495 287 55284 473 06573 824 68863 594 52254 058 64145 363 40937 577 04630 745 42124 882 87319 931 22315 817 88012 446 4939 722 2337 532 9835 816 5914 477 7823 451 0182 662 3632 080 5631 636 4941 311 1541 062 603880 176745 117640 246565 572506 560462 041427 236399 069375 806353 208335 377318 296303 172290 495277 873266 345256 236246 572234 882224 677213 810205 027197 676188 917182 791173 974167 802160 995155 975149 839145 380140 736136 598132 568128 261124 560122 307119 777117 190113 454110 382107 741105 328104 205101 11099 17097 33394 77393 65491 09789 82087 83785 37983 90582 54880 95579 69478 06676 93074 30073 50372 05570 15669 02068 24766 21964 79663 86262 69761 46560 52259 05957 40955 87854 96354 30753 52051 84450 49750 40148 71147 53946 54745 57845 03244 20143 05141 99041 13340 62739 92639 16337 94537 40636 68436 22735 48635 40734 67333 49633 26132 52732 05631 15430 61429 92229 24729 10028 63928 02427 97627 64527 25127 14727 06325 98926 09325 84625 50524 99124 66824 61624 26324 28623 73223 44923 38823 07722 83422 49822 61422 45822 03621 78621 96921 80221 56621 08721 15020 54020 34620 69520 01519 98619 96019 79919 35818 61618 57718 07618 26517 90117 41517 42817 66417 53316 98116 99016 90816 82116 43116 55916 20116 07616 06115 99215 91915 57715 22715 16215 08614 55914 71414 40614 17814 03813 94613 68013 38413 27213 13413 10512 99613 06312 79212 24212 24812 09812 31811 74712 01011 62511 71611 55411 47111 10010 97810 71510 45210 44610 30910 2729 9699 9799 8489 7059 2519 4999 6069 2579 4089 3889 3329 3628 9968 7218 7898 7668 6398 5148 4118 2288 5258 2018 0248 1948 0728 1948 0988 1267 8458 1267 7517 8787 7527 8137 8157 7017 6307 6867 4807 5457 4257 4307 2087 1227 1626 9917 0927 0037 0977 0477 0827 0856 7906 7166 6866 7806 9926 9276 8556 8076 7796 7886 6816 6416 7016 6486 6346 5016 5876 6546 4866 5566 3706 3696 4186 4846 3646 3836 2156 4136 2176 2066 1846 1776 1616 0266 1315 8505 9525 7835 8925 6295 7135 8175 6735 5145 6685 8005 7075 4495 4255 4665 3025 3805 1655 2235 0905 0084 9344 9254 9694 8884 7864 8184 8154 6324 6954 5594 4554 5574 4804 3534 4484 3184 2644 3714 3284 1344 3634 1674 1214 3004 1533 9073 7873 9883 8443 8583 9103 9423 7253 7073 6803 5573 5533 5453 4433 5613 6103 5493 5623 5263 4763 6353 4133 4153 3603 1743 2553 3293 2713 2713 2353 2113 2473 2013 1973 2522 9362 9922 9822 8412 9503 0783 0092 9222 9312 7912 8092 8102 7362 7822 8372 7192 7602 7952 7792 7702 7252 7342 6472 6832 7172 7262 7342 7062 7542 5462 5652 5332 6242 4872 5592 5792 5932 5092 5562 5982 4412 4742 4442 4922 4022 4272 5202 2422 3382 3522 3182 2492 3542 2812 2602 3242 2082 2202 3252 2352 3412 3522 3712 3752 3262 2232 1822 2352 2152 2092 2072 0751 9451 9221 9331 9701 9201 9881 9301 9151 9531 9511 8891 8941 8481 9331 8421 8431 8041 8171 7421 7401 7881 8721 8431 8581 7771 7831 7731 7661 7661 7811 8291 8191 7651 7911 8071 7861 7621 7521 7851 7771 6981 7751 8041 7811 7551 7191 6481 6641 6361 7151 5991 6661 6601 6581 6661 6471 6461 6741 6291 5871 5941 6471 5651 5041 6001 5011 5181 5471 5691 5881 5791 4741 4801 4591 5241 5201 5021 5361 5511 5241 4931 4991 4571 5111 4151 3421 3561 3011 4211 4441 4191 4401 3761 4151 3511 3701 3421 3431 2941 2811 3301 3131 3081 3671 2881 2681 2601 2761 1551 2091 2211 2481 2421 2371 1651 1771 2671 2511 1731 2501 2761 2791 2821 2361 2161 1821 2051 1441 1701 1571 3051 2631 1621 1861 1701 1721 1891 1701 1341 1841 1351 2871 1471 2601 2261 1301 1981 1651 2121 1721 1171 1531 1401 1011 1591 1341 0701 1261 0271 0091 0301 0671 0381 0221 0181 0391 0871 0521 1079801 0981 0531 0321 0859781 0021 0891 0761 0681 0831 0961 0831 0481 1361 0761 0261 0421 0839871 0169571 0391 0741 0689981 0961 0349689551 0169889309709429719829509459219569721 0571 0239378919078918688968809068928768928688509749468849109039819531 012938924914847854855854869870853872845835839833863899840809843847853846850815806776800784834773750801781819739734767758790826772812798828830792834850816876925852822770822723782804801757736737725776800803721755735731712762846781754761731683754689681704711746709770820782758685733639745686799741712775707752708728683715632676560707673666671619609634631591623617576622543599673647662633635586628607654601651605641560578566579542596592623597596496589571600614602519572548568523534528554563593533545571602528604567588579546530572578594562568558579560515559490536553559526533541517567547519523493539576547543586552574557565561525568509554515506516509565512535501537577537495532527532534539557553550568552569517481524555531542535533531563546510498535605592564455512520502509492502522529528725 149100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0010 605 15600000633 761 9960006 281 386 9160000000003 387 446 92100003 699 327 84300007 089 778 940000014 302 372 86300075 565 486 93700510152025303540Phred quality score0G10G20G30G40G50G60G70G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %727 644 71899 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %725 007 95498.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %2 636 7640.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %367 450 88650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.5 %709 304 91296.5 %3.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.2 %75 054 21110.2 %89.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

39 172 0671 335 7371 043 4391 419 858932 612763 6131 001 089863 226415 5071 043 748699 265688 177917 693849 268467 517959 886635 019588 116784 947762 850494 861864 362765 158722 6891 173 3691 609 090197 6962 878 121265 389248 958489 288452 505154 384614 280227 516218 855428 515485 654117 121688 08911 298 071645 9913 074 748863 144644 620535 665243 983248 767574 825731 0423 563 6271 165 471900 865868 209812 6702 546 001944 853952 456826 653738 378639 623 5862 8422 2512 3562 4272 5372 2232 4092 1552 204690 4410510152025303540455055606570Phred quality score50M100M150M200M250M300M350M400M450M500M550M600M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.64%99.63%99.64%99.66%99.64%99.64%99.65%99.65%99.67%99.68%99.64%99.64%99.64%99.64%99.64%99.65%99.64%99.67%99.62%99.62%99.59%99.62%99.63%99.53%0.36%0.37%0.36%0.34%0.36%0.36%0.35%0.35%0.33%0.32%0.36%0.36%0.36%0.36%0.36%0.35%0.36%0.33%0.38%0.38%0.41%0.38%0.37%0.47%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped