European Genome-Phenome Archive

File Quality

File InformationEGAF00000660982

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

371 276 07676 836 89316 772 1957 423 6824 819 4453 869 2373 355 0013 005 6412 756 6752 551 6902 379 4972 240 8782 121 0572 014 8571 920 3391 838 1771 763 5981 692 9521 632 9731 577 4011 520 4051 466 9701 419 1231 372 8191 326 4021 285 0581 243 3791 199 7741 158 2931 121 7691 082 9971 045 2481 004 873970 212936 525901 474870 424841 563810 490781 419751 859722 470693 455667 575643 628617 036594 796573 253551 735531 484510 570491 680474 494456 467438 947422 857405 297389 883375 762360 734347 856334 864324 019312 924302 530292 654281 209272 021263 509253 690245 882235 948227 373219 124210 531204 349198 125191 281186 421179 992174 433169 291163 453158 100152 258147 559143 576137 323134 045130 868126 612123 211119 780116 220112 441109 558106 695103 384100 16697 70194 67491 64188 71286 13184 18081 19280 35377 43575 31773 81371 49769 72168 72966 39565 26263 03761 33159 90358 74956 46855 20753 94152 59450 98650 10048 82447 54746 51545 22943 57243 29041 86240 87140 12739 51938 36637 24936 14535 33334 86233 86632 71632 31531 20330 77430 08429 58328 95027 81127 56526 85825 89925 66725 08224 71424 21023 40222 56522 44222 17621 59721 29720 75020 16620 01919 21518 66618 21217 87817 64117 41216 86516 77416 56316 20715 49215 28714 79314 38214 15114 12913 78113 61413 38013 10812 74812 67812 26111 77911 66311 40811 06910 88110 46810 53610 46610 1949 9299 9669 5969 5729 3189 3919 0768 8858 6438 5258 4618 2208 1077 9407 7257 6697 4397 4077 1556 9926 8826 8736 4906 4036 4256 2906 0536 0335 8635 8465 6745 4965 5565 5375 5075 3895 1835 1045 0404 9194 8404 7344 8424 7124 7324 5604 5674 4854 4934 3534 4244 2044 0714 0204 2143 9733 8633 9303 8633 8623 8253 8213 6603 6093 7423 6443 7313 6123 5343 4483 4043 3233 2303 3103 1893 1233 0523 0552 9213 0902 8872 9302 8782 8452 8432 7282 8752 8542 7142 7282 6912 6652 6502 5612 6062 4932 5212 4892 4362 3812 3172 3432 2892 3192 3202 3422 3112 2452 2502 2722 2712 0992 1852 1652 1392 0862 1302 0492 1121 9632 0011 9801 9431 9891 9141 9001 8831 9131 9021 7691 7681 8491 8221 8101 7351 7531 7741 7271 6841 7071 7061 6521 6281 6591 4941 5541 5681 6051 5011 4691 4431 4681 4541 4541 4121 4171 4091 4081 3931 3441 3501 3901 3391 2801 2961 2901 2401 3011 2281 2281 1541 1291 1471 1291 1211 1561 1301 1261 0771 0741 0511 0881 0731 0331 0771 0161 0691 0631 0059761 02597494291995792292891095091085483087879789285883081080776285176479682379478578171573474271370962368473172369765470868874066566475467867066263770071862865860763062762563966665758565663963966456261561656660858856456256752359358156157649751650156852950750752253952554148552553753951653051454053849747648248551649344351547749946247345547747143642944745541040038939241740939337038436937838235638735038236936738338435435737233137735334535733436839938534637334533732835332734133034831030934131627833234729034330431232333135032932130528832728432631130231033727229631131428632229325228925227630126328630029726727229425629828028127327325227026226724924425324226827924827325823024625024724027926526824125526025423124325323225323823623821623823522822920523825421822020623623523621520519520621519719622419721419918520319320418420119522816417320017421119917321015916819516216816817015316917916114817115713417213614217013916916217014915016614715514512415311915214415616115511713315614116214012114113413011112213915313512213611711111911614612599144100136118118124137112134111102109130999910583115979812212011095108107105113106939811710388969610796929976941027593897686828791889097908391999197111948988829588968780948098979999857883788370879074106951069075728076838589776894847078968076626577768086717074696885706549676169665962727953708164687561647368687555656560704879686063656876696381676464487350645572646544785454726770665055655353545566716558636363616354374963485166466859445657475940505952354744364550524641443751446044404747404435373527494439454035483852433345364644404338454134553944543734344449333034364110 036100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

007 515 103003 779 3053 949 47615 660 18415 159 0036 400 7458 447 3773 749 4782 722 7656 303 4282 620 2116 564 9966 890 8337 746 07511 943 5347 207 8578 198 5335 713 69610 348 55714 738 69916 717 35819 327 92530 908 03128 177 63126 474 88529 466 83762 286 832100 705 93660 817 23899 538 639190 533 426202 325 996141 401 885348 804 727293 801 998417 762 281524 006 764825 131 80600510152025303540Phred quality score0M100M200M300M400M500M600M700M800M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.8 %47 077 25098.8 %1.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.4 %46 902 70898.4 %1.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %174 5420.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %23 825 66750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.2 %46 812 73898.2 %1.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.7 %1 295 4082.7 %97.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 556 0865 1454 9389 4455 13212 73611 70623 89817 83155 75545 15513 18892 97419 01917 104209 90636 195121 93872 84425 899140 2471 60988 197314 6921 61826 9262 1522 0602 1761 503 3085 6094 3545 7786 7447 5509 940192 170587 75818 6785 03031 67817 5883 50846 3742 9765 450129 0185 39811 4608 18822 2509 00430 22227 56043 67076 718178 36837 718 414051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.62%99.67%99.76%99.35%99.74%99.8%99.58%99.63%99.38%99.32%99.78%99.86%99.84%99.79%99.54%99.45%99.67%99.61%99.78%99.76%99.59%99.55%98.02%99.6%0.38%0.33%0.24%0.65%0.26%0.2%0.42%0.37%0.62%0.68%0.22%0.14%0.16%0.21%0.46%0.55%0.33%0.39%0.22%0.24%0.41%0.45%1.98%0.4%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped