European Genome-Phenome Archive

File Quality

File InformationEGAF00002441833

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

35 984 01589 142 104173 187 262269 658 621349 627 040390 067 301383 542 803338 205 967271 468 197200 642 621137 933 61889 038 94954 485 84231 810 76317 956 7419 961 6585 558 1593 192 4161 949 7401 298 435942 123734 861599 138506 290439 079383 234341 318307 386280 589259 110238 266221 259205 250191 153178 797166 209154 505145 033135 802126 319118 012109 254101 01694 09087 51180 90275 73069 38865 48961 60357 95353 22149 52446 66243 72942 21240 29337 71936 82434 50033 87131 95131 34630 11228 43127 70525 96225 29225 06824 03823 09522 82121 83820 91220 24919 71719 33418 63018 52618 12017 69617 26516 62416 18216 35815 46014 96114 26213 98513 77713 60013 49212 97812 36212 37311 63011 60011 27211 32811 32410 85610 71410 73110 37710 1079 5869 4729 6209 3549 2959 1568 7718 7358 2618 3668 1308 0617 9608 0747 8377 6077 2727 1967 2517 0386 9826 9786 6666 9236 7336 5996 6676 6376 6046 5326 4706 1055 8226 0126 0305 8865 9775 7265 6555 4575 2975 3335 1624 9905 1605 1715 0385 0564 8874 8744 8764 5274 5394 5614 4394 7494 4884 5684 5204 4114 3444 1874 2274 2214 4534 2594 0063 9474 0513 7893 9623 8093 8113 7563 5393 6383 3623 4113 4073 4113 2963 2913 3563 1373 2483 1603 1723 2513 1643 1243 1423 0062 9893 0652 8742 9472 8602 7852 6212 6652 5962 7262 7852 6872 6392 5722 7132 5612 5562 6042 5612 6062 5262 5432 5912 5422 7102 4942 4742 4722 3952 3752 4512 2782 2142 2812 3742 1902 3332 2342 3002 1832 1722 1632 1732 1492 2632 2612 2042 1732 2292 1972 1382 0782 2292 3022 1802 0642 1642 0422 1922 0562 0462 0281 9771 9721 8951 9501 8501 8891 8701 8491 8361 8311 8211 8521 7051 8041 7261 7221 8361 7971 7351 6821 7771 6791 6021 6031 7091 6151 5611 5591 6041 6121 5561 5151 4451 4171 4021 3481 4421 4141 3521 4051 3641 2401 2671 2901 2181 2141 2791 2521 2231 2761 2011 2251 2731 3171 2381 2671 2221 1821 1831 1781 1231 1171 1931 2411 2241 1721 0911 1791 1541 0821 1151 1231 0561 0301 0611 0841 0961 1111 0781 0801 0701 0179801 0511 0241 0159269919549641 1101 0439671 0129639761 0539779621 029896948946868856908853879857840913851895856916820900946800822838885773820825842839856847775787785753733750764793790739764788742766766783777797802820734760815780749706735701663680698688732671647667667693731659681672657685696640713715670651653689660685643626649597652635617632646608629618540651620570584558524606527557597568541572552548563561586566528591527559614527504531557520505514518545497568484525564546485451500463486539513474498464530556498507509450501480471475476453465496472522493516492514544535490516549497511545459522502467514473534557502494469492494497504548508557511504456539521472541497500495512524486502504514490461456480461498419459440453471539501467444479523463519507493501518491531548472475468517434447459460447448457447436510449421442423442456418410409432408440437394398394435439384455402446408422451413388406374415399409388361393388376444363365359394363385359359350341357355338364363355397388374353325377402350323326340340323327370334333340313339318314284330349351317368307354333366331318365333320297338346366363375330370361336327390371359333336388353344325359360339340302348323356351357333318361334332334329341317321359329327378336375332339296332342329340345331346355354329355293354341399346354317357325333326346360341329308349333305324333349323343334304370357356381368335370359381344372343330334305338343343328353341365334330343384327347323303352329289312320335324331323294275267324303296279289258273267280299311257283301237289288260262245323331297268283300281258272244285270265284252265257268271276260236242266241246233267227243213229243239252247249239254232261247270268283237280251226270265228245246244232226231239243233241217262233216202256241243241243225228214261228228245204238208228213204226213234226233237236232208239243227201213192213193184195184192204191172170175176185203193182149158170169158177183174180162182179163178183156158188172193164185188154171177176177174135184173160149144156158178155233 006100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 431 809000000052 800 611000837 086 445000000000501 604 4330000585 301 51100001 269 338 43000002 575 682 06600015 611 979 16900510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %141 625 34499.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %141 343 74299.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %281 6020.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %70 984 18750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.3 %139 503 89698.3 %1.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.1 %8 685 1126.1 %93.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 304 667119 42371 409144 712103 525110 814126 374170 37374 870134 75557 11651 87374 40484 33442 425103 14065 10374 971103 498140 801144 233140 876164 870134 875219 468383 83121 424708 56931 87431 46768 65668 71329 39383 43832 21833 24454 94974 53018 897114 5771 822 83679 15572 174125 867106 323201 006174 337273 877473 97046 97667 71456 07577 64434 24458 71364 90645 343191 79646 477100 984128 549 665051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.79%99.79%99.79%99.79%99.8%99.79%99.8%99.79%99.79%99.79%99.79%99.79%99.8%99.8%99.8%99.82%99.8%99.79%99.83%99.79%99.79%99.8%99.67%99.75%0.21%0.21%0.21%0.21%0.2%0.21%0.2%0.21%0.21%0.21%0.21%0.21%0.2%0.2%0.2%0.18%0.2%0.21%0.17%0.21%0.21%0.2%0.33%0.25%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped