European Genome-Phenome Archive

File Quality

File InformationEGAF00000644192

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

214 424 41545 493 94611 597 2106 758 6374 638 9833 833 7203 322 6162 967 2482 708 5202 494 3552 320 5312 172 4612 043 7611 943 2311 849 4341 766 4801 687 0861 619 1631 566 0441 510 7491 460 6441 413 3661 366 0891 323 3661 280 1641 248 0321 208 0501 172 5571 141 4021 109 2181 072 9481 043 0321 014 459984 044956 433928 288900 260873 403845 196822 176796 761776 984754 746729 863709 320686 730668 223646 284626 343606 781589 284570 138552 123536 885520 528501 803487 115470 247454 630440 745427 492414 423402 441390 077377 427366 339356 475344 538333 393324 528313 876303 579294 730287 761278 703270 715263 075254 737247 917240 139234 074227 084220 724215 188208 644203 711197 479191 777186 432181 623176 422171 409167 450162 538159 268155 154150 981146 554142 820139 980134 880131 637128 006125 771121 613119 112115 076112 434109 558107 090103 642102 51799 95497 68595 43993 75491 02589 06987 60585 35283 12581 29779 52578 10976 55775 00972 80171 13669 98168 29166 75865 27064 54662 78961 59459 73258 37357 10756 31554 53853 89453 00051 66950 87049 53948 32447 25246 37045 44444 48643 92342 86442 10640 88940 19439 40338 93837 93237 35436 89535 92935 61535 31634 30633 85033 12832 48431 56631 20630 31429 71829 29528 61428 02227 08126 86026 22425 93625 49125 14724 93024 07023 83423 39822 95222 38222 09821 45821 33121 03120 55220 03120 05419 62419 22318 73518 67318 07617 65217 65317 33417 01416 62016 30616 04315 88915 57615 08914 89914 88714 35514 11413 99113 52613 24212 92512 71912 59212 60312 16012 10811 78011 52611 65311 36411 17011 06910 44910 52510 41910 29510 13610 0109 8169 6169 3369 2299 0588 9088 8398 7638 5218 4908 3038 2998 1328 0527 7997 6477 6417 5447 5097 4027 0147 1417 0516 9896 8106 7116 5696 4956 5216 3426 1446 3156 1296 0356 0385 9005 7645 7655 6235 4815 5005 5745 4825 4425 2555 3055 1074 9884 8664 7214 9994 8964 6894 6864 6524 6264 4194 4514 3604 2014 2674 0154 1684 0534 0824 0313 8403 8043 7473 8033 7033 6053 5593 4013 4523 4273 4683 3453 3733 2513 1893 1243 1243 1523 0892 9953 0472 9582 9372 9392 8732 8412 8712 7632 7382 6402 6362 7102 6282 4652 5242 4422 3352 3602 3662 3202 3522 2382 2092 2592 1492 1222 1552 1212 1272 0482 0822 0551 9471 9801 9101 9501 9331 9381 9191 8921 8511 8681 8621 8371 8021 7841 7421 7381 7051 6981 7311 6691 5941 5821 5021 5541 5281 5271 5151 5191 4251 4201 4901 3871 4461 4641 4101 3901 3351 3501 3821 3191 3181 3451 2871 2171 2301 2441 3141 2571 2661 2291 2701 2271 2081 1561 1521 1701 1031 0741 0721 0941 0451 0621 0029871 001967973930964902943984899905926960944865907882887943895912896856828852859842811794841861817770790743731758769781774738754755756746714724730751676689698755641717684676652678616629584670612615615631611633539606623571602541566587582548611597584540516513530545510551548558527555573504546466491514489480519482485461490451459489453453431429439440403432432424419394426410472386401379406403400361368350399349364354353364322379372398349335350321328292319343335343352333327322324316321308339341334317338321293337341329326303319307324315283308302304345322294299269305307313270252274271272280305281263289283254264259267297261281281259271260270264241262243268240249238264248235255212212213233227226202223199213194221210197217227211229211205191192193191197197200185192218200141205190196179180184167176194173164205164151191190167176164166151163161170175153186139166165178176167144130158139147140154113155151137140145125151153142137124120129144137150138138147115127117122127136128123101131113116110136108105119126111136128107116127112101104116971021021041021119299120122921061019497104107118939789951081101041109897939612511510110279791039910510410392861039587103819096897271879510379959185797583738463998081969482749592808689718182788586857687776879957982727386808698887284597776818277607752666563587270705867616269616470626047576153544149526649716067586350534352564354535653545344516345524543455043544957466757355944505166535647534855434255425053414640434546524745514344405432443341404149353138332934253948322646433129303325353533314231313931372123253112 950100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 436 1721 375 288913 8081 366 805827 4581 896 6284 450 3392 062 1018 088 32319 185 7567 737 2056 957 8122 806 4414 715 9395 057 9349 796 05712 986 16023 197 50618 743 83820 261 35610 844 02310 641 31615 625 72817 744 47116 266 59156 515 22230 580 23131 392 32735 352 55739 451 91949 321 95954 260 68561 761 39878 109 638101 848 287213 262 585170 023 095198 239 541273 121 524393 354 412420 646 106454 143 697489 257 112293 562 369156 278 028100 878 33827 957 83216 515 6830051015202530354045Phred quality score0M50M100M150M200M250M300M350M400M450M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %52 528 13099.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %52 376 77898.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %151 3520.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %26 485 46450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %52 322 09898.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.7 %3 023 2495.7 %94.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 616 0963 9833 2407 1364 00011 69310 42521 80617 88749 16041 95411 75797 57816 62316 663221 23837 398115 16476 20129 734157 4801 335103 924353 7861 15322 2531 7671 6211 9111 399 1935 2603 9945 7407 0847 62210 070216 957624 22219 1684 87436 40618 9082 26651 1022 0224 392148 7543 5989 2806 10218 4347 80227 34426 31043 52280 902196 29842 928 336051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped