European Genome-Phenome Archive

File Quality

File InformationEGAF00000643975

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

153 849 03231 950 34110 228 8736 626 9944 896 2094 120 8493 586 1933 217 4142 934 4242 701 9362 520 0422 363 4162 236 2762 117 4012 014 5811 919 8921 850 1101 765 8831 696 0941 632 6551 575 2421 515 6971 461 7951 407 4811 361 5051 316 2081 265 9041 227 4971 182 6761 137 5261 102 7811 057 8031 022 990990 139957 986922 959887 176859 917829 710800 682771 091743 796717 655691 962666 128641 692618 634596 624576 768557 060536 021519 708498 718482 384466 875449 001433 814418 487404 662390 893375 643362 209350 402340 299331 020319 583308 395299 812289 108280 372271 160262 222254 227247 273238 475230 379222 999216 461209 868203 034197 629191 922185 977180 520175 079170 372164 582159 700155 457150 612147 947143 384139 199135 991131 396126 480123 913120 795117 198115 162111 897108 871105 574103 329101 03497 86794 72993 27090 95188 64387 50684 36782 60480 45778 38576 52774 13071 96670 85269 18766 90265 57864 21662 63061 45860 13259 09257 72156 76555 32454 21252 96452 17050 92650 01248 77447 32946 57445 23143 93843 47543 26342 17741 06440 18639 72838 51437 76936 69735 63135 69234 82134 38733 45033 25332 65131 72631 06830 51429 84029 14128 55727 99027 48226 90926 51826 01125 50925 31524 96924 38724 22023 75123 22122 88722 40721 78021 43921 10020 77020 48520 12019 92919 31219 16318 92618 49317 94317 73417 39617 08916 93816 47816 33115 83915 46815 70015 53114 64614 67114 18413 91313 68113 37312 86012 90212 53112 34012 17412 06811 72311 72011 34911 31911 23511 04310 56010 53810 50710 25310 0469 9309 7829 4619 1849 0779 0628 9198 5878 5668 5208 4268 3308 1688 1018 0027 7287 5517 5267 3577 0316 9776 8976 9006 6416 5926 4526 4416 4526 3646 3256 0685 9685 7825 6425 8215 6455 5055 4865 3895 4245 2265 0985 1224 9944 9104 7064 7504 7654 6574 4384 5044 3374 4084 4444 3364 3624 1344 0564 1694 0003 9723 9403 7893 8153 7053 7233 6613 5653 6333 4783 4263 4583 5073 4033 3513 2893 2463 2563 0833 0792 9783 0263 0303 0532 9702 9142 9562 9812 9222 8822 8122 8102 8212 8442 7232 6882 6102 6292 5482 5542 5152 3512 3832 3392 2822 3312 3172 2662 2692 1912 2852 0842 1922 1632 0512 1552 0612 0832 0312 0211 9801 9741 9141 9771 9721 8961 8901 8291 8231 7821 7811 8061 7711 7021 6501 6681 6961 5751 6431 6711 6151 6441 5661 5761 5471 5521 5091 4091 5131 4471 4451 4151 4251 3641 4861 3951 4111 3881 3811 3991 3891 4431 3201 3541 3461 3361 3661 2471 2281 2711 2421 2501 1771 1971 1741 1811 1501 1791 1341 0951 1241 0341 0391 0761 0271 0511 0621 0691 0461 0951 0749989709899831 0219809311 02393693491386288792193784886388185285688482277879484381873379173771772972075470369869169464566872765870466068967165665064667058161765767568167665767465564765262965267860259065360859763261358360757558960357156358153655654555859155251854855252853552153351250152649951347549751852448651147546747251255047248750046045549047248345143347044742545944139143646742145440739540340639436439336138439040939639640437737139839235940132236736937643236939440139636138731834637632535535434333034631736333532934631031331330530831529728229529029626628727227631127725829930327228728832225831928529728225727528129628627628626526723028225121925123122622522521021920121019522119421220820219220121619219119521018918119818817419216818818316117816418915915418915318516817018416715815016216914616516613914014714314816113614814215514512713915515915515714116215813115614914015013514213813213512112813111813213010114310314012712212412810413813512915412512310911011913013211714910411313213012315113512812511911012212110611711313012211910412011411711911410613012811511911512310512112611110311111713211512310810812111312011511211311010111111410010210384871009581968910291919191101979289879210582928688887488779866737481737072837276747669556568747071515456705454706260675764614765745249594151495233697253635276665053665168505349486159515759516858505558584949495456495846486252445754606950515457425961554163594556575632506166555650584646514766525340375040474743464942514941504456363748385050413641353544483743413940544055464344503946384240464938365043364745454339474347454219 369100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 595 11215 735134 282312 386594 961256 1403 323 5342 621 959546 5671 681 88220 922 0069 832 0924 102 0781 625 858479 6183 016 9373 326 4172 372 3396 985 00623 355 22528 645 24612 592 71421 689 0288 545 76412 393 24411 083 0799 866 65623 063 74537 933 46036 487 10344 275 11135 224 68641 554 06971 525 25979 283 453137 104 858149 895 707176 598 717221 841 095271 769 585299 012 989399 382 829604 047 845427 966 368250 462 28353 032 38319 781 2636 802 5481 511 1713 459 108005101520253035404550Phred quality score0M50M100M150M200M250M300M350M400M450M500M550M600M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.3 %47 471 96399.3 %0.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %47 343 97299 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %127 9910.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %23 906 21050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %47 283 66098.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.5 %3 104 1746.5 %93.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 219 8904 0534 0578 3765 61412 6949 07321 23518 03450 58642 86513 94896 38720 07017 291201 63737 68993 06575 52528 169150 7331 37789 119322 7721 32818 0682 0771 7991 9751 205 6935 2064 3585 2627 3967 10010 578201 451591 06218 0124 77432 35816 8901 51246 2941 6663 140124 1784 0827 4546 37615 8367 96827 13826 26842 56876 026180 43238 561 836051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.72%99.72%99.85%99.49%99.75%99.84%99.65%99.81%99.58%99.56%99.85%99.88%99.82%99.86%99.58%99.63%99.72%99.77%99.79%99.81%99.65%99.68%95.96%99.64%0.28%0.28%0.15%0.51%0.25%0.16%0.35%0.19%0.42%0.44%0.15%0.12%0.18%0.14%0.42%0.37%0.28%0.23%0.21%0.19%0.35%0.32%4.04%0.36%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped