European Genome-Phenome Archive

File Quality

File InformationEGAF00000644112

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

216 975 98340 561 96511 229 8966 513 4874 739 7253 964 9433 464 7663 107 6192 839 0582 630 1092 450 5172 302 6532 174 4522 064 4261 961 5661 873 0051 788 7331 719 3501 648 6061 586 3001 525 9631 474 4091 415 5681 369 3841 324 8731 277 8511 226 5261 185 2181 141 7581 102 7221 067 3411 033 514996 586964 808930 822898 243868 532839 521812 390784 114758 466732 793705 949681 675658 621635 677614 455591 892571 486554 721537 999519 840501 335484 774468 991452 243439 467423 379410 156398 300384 890373 572362 748351 647341 108330 710321 686310 639300 157290 564281 894273 073264 280257 988250 660242 782236 380229 511224 180217 732211 165205 497199 179194 120188 898184 030179 718175 358170 030166 064160 526156 323153 110148 112144 831140 740137 869134 305131 606128 160125 266121 032119 049115 825113 103111 258107 898105 561103 201100 39598 28896 14093 55191 51490 22987 67385 38283 47681 77680 30278 25176 53574 43872 78071 51670 21268 56567 12765 60664 37662 72461 77860 63659 73757 89057 04355 84454 92953 49952 20851 96750 18849 30048 65647 74946 90145 61644 87344 09043 68242 83942 52141 07940 48439 61438 90738 38337 54836 87036 48735 85635 34934 71934 23433 55832 70632 31131 69631 33930 39029 67129 31028 92828 18327 64927 67427 20826 26926 16625 65925 58324 98624 67123 76723 21822 99422 62422 36221 72021 51321 08920 92220 45120 33119 91119 85919 04618 96018 50817 98517 76717 91917 61917 51717 20316 76216 67916 50316 36315 87415 90415 43515 18914 95114 98814 79814 27213 96714 03913 75913 38713 31713 14113 10212 69512 56912 39012 07912 02111 87611 59211 62311 48811 29611 22110 89010 68110 83610 65410 48010 17210 09910 2759 6309 7749 6319 4059 3219 2109 1539 0738 7768 8088 5708 5898 5958 3688 2158 1908 1148 0568 0737 7577 6227 6187 4287 4327 2107 1297 1326 9676 9206 8626 8396 7216 5356 7046 3096 3546 3466 2746 2666 3086 2566 1416 2005 9926 1025 9765 9825 8355 6555 7875 7695 5555 5005 4075 4015 2025 3245 0585 1425 0214 8684 8724 9174 6434 5094 6074 5394 5584 4564 3494 4364 3554 3564 2724 1064 1804 1953 9783 9683 9313 9703 9123 8983 8303 7213 6623 7133 5463 6363 5343 5063 4953 5713 4313 4513 4183 3793 3113 3443 3803 3363 3223 2393 2483 1703 1643 2243 1853 0733 0713 0612 9712 9872 9392 9552 9032 8722 8602 8352 7282 7022 7362 8152 7932 5862 6562 6692 5312 5262 5502 5332 5762 4112 5572 4292 4382 3122 4062 3252 2682 3072 2382 2312 2212 1592 1452 1782 0842 1731 9962 0242 0942 0642 0072 0481 9351 8331 9921 9241 8521 8401 8481 8531 8721 8701 8511 7771 8391 7611 8071 8281 7181 6371 7371 6521 7021 7041 6791 6171 6331 6131 5821 5491 5921 6481 6191 5151 5971 4651 4351 4541 5031 4881 4141 4021 3791 4311 4291 4431 4661 3951 3721 4241 3751 3131 3071 3311 3701 2981 3141 2901 2691 2301 2331 2371 2791 2341 2351 2321 1781 2591 1711 2121 1261 1961 1111 2041 1291 1101 0721 0601 0911 1501 1181 0701 0881 0231 0801 1011 0831 0961 0881 0961 0561 0411 0339891 0021 0901 0021 1161 0271 0021 0571 00597894593396995897896393393186987483486085582084383983186884481475780589981684077974674673775679476173679976877576676570872775269571773470373274970271268577570970864262066165062363459665062966361063362559359360860759859756160357755055554853655353655855152354155658156759557054554652852754960757860052158260355953957354055651053659354951854648555750154352651845849749844653449150349955854248453751848947951445448743748849445942249843744846343046545142448745242549046940146844843839543639143546241442944841441838537343738040040637740338938938136832938436534634034135736431829934635829329930933830332631228028429430630433528427827427728526130827928427827127528827328627326524826723827327221322625223325023023026023323123724623324223324823422423825824622622521323422126223423621821721524323424023918920320422522322318520820921120219620318318718920616918419218318318618816516716415517016815315817817315416115413214217113514216818115015317715216817315014414115715014513014116314615415414615214915014617014114513914115113812513512413912213714614815414413712212813111813212812613813810613313210812012212312112913310610113310813313011411612612012713412312012213111810011211010611610210999116100951061038397103103123911051099410095100997295851108590981028497839380758784889398817989857682799088836510891698588718390637375767876626976735481736254667267716059726375666168917966706973526352586056657563666950657266585165576722 147100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 621 67440 607143 2491 482 4143 441 74314 226 9209 235 74527 201 29250 036 77717 255 76214 009 5548 641 22717 015 48463 214 72844 069 10038 296 80616 613 66824 361 16821 969 01624 297 02656 894 41244 161 83448 022 14737 717 92148 626 43190 096 80968 150 13054 881 26261 283 42795 064 521104 157 02392 771 914123 685 628132 946 559164 779 299287 277 721281 715 283275 113 611376 607 044348 371 137321 447 056268 512 268105 153 08573 099 43060 319 20228 524 79103 744 58606 209 359005101520253035404550Phred quality score0M50M100M150M200M250M300M350M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97.5 %52 779 64697.5 %2.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

95.8 %51 834 62895.8 %4.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.8 %945 0181.8 %98.2 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %27 056 71950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

95.7 %51 786 95695.7 %4.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.6 %3 015 9605.6 %94.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 595 8093 8333 7457 2073 81710 91210 01521 80517 57258 45259 23011 199126 69817 63816 178317 20639 448173 86980 94830 172160 9851 403104 395365 76099840 9181 6011 6771 6445 292 8444 7464 0245 5847 3887 15410 188214 183998 30119 3464 72634 70418 7582 40651 3981 6065 656137 9483 04215 8685 37223 0967 06025 49025 74442 30478 118187 91038 593 340051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

97.66%98.32%98.81%97.91%98.5%98.7%98.06%98.6%97.85%98.1%98.69%98.85%98.81%98.81%97.86%97.43%98.08%98.7%98.58%98.81%98.34%97.86%93.67%98.17%2.34%1.68%1.19%2.09%1.5%1.3%1.94%1.4%2.15%1.9%1.31%1.15%1.19%1.19%2.14%2.57%1.92%1.3%1.42%1.19%1.66%2.14%6.33%1.83%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped