European Genome-Phenome Archive

File Quality

File InformationEGAF00000642905

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

379 990 33985 266 48019 783 8658 568 2155 287 2134 141 5903 531 8503 151 3002 868 4282 646 0862 465 3652 315 7272 185 0732 070 0791 971 8201 885 1591 804 2641 728 9161 661 1591 592 6071 529 6401 470 6141 414 0351 361 1191 311 7311 263 1511 215 8551 165 5771 124 6201 077 2631 035 351995 584958 232921 649887 186853 181819 462786 812758 220728 193701 480673 927644 610619 563594 900570 844547 931526 266507 160487 514468 058451 674433 400416 630401 279387 464373 738360 108344 579333 245320 727308 314297 138287 387277 302266 966258 423250 302239 921232 142225 011217 192209 949202 130195 319189 791182 400177 491171 327166 077161 132155 833149 767145 314140 460135 927132 254128 151124 937121 395116 965114 645110 929107 395104 583101 37398 22694 35192 75589 47386 91984 28281 73779 29276 98474 92673 37170 52869 25767 27565 38463 76962 57860 30959 16357 67055 80753 86952 87552 16450 71949 65648 25047 49246 28745 03343 88243 14241 61441 46039 97539 28638 70837 75636 74535 83335 00534 35033 47832 85231 98731 44930 71529 56929 28128 26528 02627 45526 61926 13225 13624 87524 25023 53323 19522 65522 34821 59521 26420 89820 36520 23119 62219 13218 73618 32118 40717 86817 74317 10216 75816 46415 96215 72815 50815 34415 05814 91014 67714 41813 91313 64513 32513 07712 77612 53812 26311 99811 95311 67011 46111 43010 88510 74810 57510 51110 2639 8769 7659 6119 5909 2809 2138 9998 9488 5508 5668 3408 3698 2008 1177 9807 8727 5487 5047 4287 1107 1167 0046 9836 7476 6236 6356 3836 3206 0396 0936 0286 0355 8415 9805 6705 6945 3055 5325 3174 9855 1304 8674 8494 8644 8594 7514 6504 6954 5354 5654 4784 4304 2724 1574 2264 0253 9934 0124 0383 9883 9433 7433 7703 5873 6463 7613 6953 4903 6093 6183 5363 4513 4343 3993 3563 2483 1903 2923 1553 1063 0843 0653 0052 9622 9552 8692 9472 8482 7302 7332 7412 7172 5732 5612 6722 5972 4912 5262 4692 4282 4682 5292 4072 3562 4492 3772 2792 3452 3362 2982 2822 1932 1732 1702 1482 1992 0932 0572 0152 0311 9791 9481 9891 8661 9211 8711 8221 7811 8041 7431 7511 6711 7581 7171 7361 6811 6441 6321 6491 5761 6291 5921 6471 5791 6021 4231 5241 5781 5271 5261 4981 4461 4391 4161 4491 4261 3891 4141 3821 3221 3111 4281 4121 3521 3211 3241 3381 3341 2561 3191 2931 3251 2381 2471 2271 2051 1851 2161 1731 1401 1301 1521 1021 0681 1021 0601 0541 0891 0751 0001 0371 0591 0791 0571 0189709659569401 02892397692296788393389787791587989283183489589285276077680683974378679978383480578480681280179870877177976871673164175265064867572168366868063668865467167562265161965663660160263766462263656764159863162758157460661560856656653754256352856659557860951654453854355155653749956951849851549549651848147249648247346647643644448250447044944845644944342945144943244544742446641739839941640939739138039938538238133137032334838133637533934938436033636336531535834432636534731434234333535333536134934232228731630732629328028227532429129125529826127628224527527527626627728128330429824230926027730127529328724227026926626225325623923325223722425122920625120822522022120123622520822021419321120421018520822218820818518818319120919618120120320118817716516917519720817217317417619515516818216319517414016416815314914714416116216115114613611515214918116314015614213513814012912913310613212816614015415115113613913314213513811213113614612612912812711613613611612311413111811710511210713711513513313110412711011512810212612611010411891103961121081079610089869289105931171008896998610287928710495105108104909496869495939182999210574117939799101100105105958480879986888694958997101728699809163637180838190607170665980536249656865605356587158566166395259624450505559505358516048465550483444494543446247434345375246593261554948484751445038434340485243493730313835344534434737394534413843354434313828223637323529353233323333423945402936504533373238343339433637484536362844343632322634342831312431363430273133302230271834352231223422253523203029202326302738272733212728343128262630223622209 899100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00106 382002 394 4922 272 2109 683 5628 970 8784 275 6566 097 9532 392 6291 758 2204 117 1111 765 7074 620 3724 302 9044 559 7518 458 3185 203 4826 053 0034 373 5357 787 15611 740 76013 530 61615 515 53124 576 56823 634 94121 364 63623 392 99253 586 43186 990 17951 194 51089 025 090176 261 922213 116 053129 751 023341 021 355257 789 463414 845 180481 659 108962 731 62100510152025303540Phred quality score0M100M200M300M400M500M600M700M800M900M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %45 962 70599 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %45 852 12498.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %110 5810.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %23 206 14250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %45 784 80698.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.6 %1 205 9742.6 %97.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 153 4364 5243 7358 6314 92310 8608 94420 57016 42548 81841 00212 11087 83717 91015 977193 75733 45893 71772 45727 280143 4991 57490 649318 0151 53017 1992 3801 8962 261998 7346 0774 6305 3466 9187 58810 114190 507554 61218 5385 07833 60017 9822 40448 4122 7324 318133 9225 7548 7408 72818 9629 45631 04427 93244 47478 668180 25637 491 384051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.79%99.74%99.86%99.53%99.86%99.88%99.73%99.83%99.53%99.57%99.85%99.9%99.84%99.89%99.67%99.74%99.77%99.76%99.83%99.82%99.64%99.73%98.75%99.67%0.21%0.26%0.14%0.47%0.14%0.12%0.27%0.17%0.47%0.43%0.15%0.1%0.16%0.11%0.33%0.26%0.23%0.24%0.17%0.18%0.36%0.27%1.25%0.33%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped