European Genome-Phenome Archive

File Quality

File InformationEGAF00000644671

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

264 020 22350 273 53912 493 4966 962 1304 872 7224 043 0873 514 1653 141 7962 863 4112 640 9012 457 4022 301 2132 167 1882 056 3551 954 8381 868 0241 784 1361 704 7271 639 7151 577 9981 516 7481 461 8451 412 5471 364 3201 324 5001 278 3631 237 6741 200 7901 165 1901 123 7781 094 6531 056 0731 029 143997 550968 235942 136912 113884 263855 820828 142804 000781 769757 624731 953707 493687 187666 152645 937624 330604 762587 288565 639550 668531 342512 886497 633483 277468 724454 337438 352426 446413 391398 095385 796374 198363 242351 077340 888330 495319 799311 641300 533291 806283 685274 554264 311257 908249 432242 697233 210226 826220 432214 559209 126203 844197 823192 496186 479180 367176 078171 054166 485161 570156 175150 782147 177143 037139 647135 913132 707128 610124 285122 577119 150114 974112 351108 981105 891102 789100 52398 64295 56893 54490 86388 16785 87683 81781 96479 10676 94075 48173 40371 31869 30867 34266 00164 65863 14861 33860 31658 79457 28356 00154 29152 96451 81950 87749 14547 99646 16945 87444 68843 76742 68341 76540 63040 20639 15038 01336 99736 43535 70134 89534 13933 30132 80031 88731 60730 58129 83029 42229 14127 85227 69227 26226 29325 76725 53624 48424 14124 09523 35122 72722 49421 72721 65521 01920 35420 03719 62319 34218 77218 84317 98317 59717 55417 09116 92016 43916 31315 95315 44215 14014 93414 70314 56914 41014 04413 85913 71913 46213 02912 81312 45312 12912 06511 79011 83711 37611 06110 91610 88110 77410 64010 43310 28210 0559 9809 9269 6869 3509 1669 2088 9548 6618 5218 3818 3588 0938 1007 7797 6797 5537 4807 4467 2657 0327 0036 7986 7006 7256 6166 5906 4126 5256 2706 0286 0085 8875 9185 8425 7575 6275 6105 6305 3595 2225 4495 3885 1885 0925 0544 9704 8494 7384 8174 7024 5734 4084 4384 3944 3354 1124 0724 0584 1414 0284 0133 9214 0193 8333 8453 7803 6043 6563 6183 4863 5393 3183 4173 3423 3383 2973 1583 1623 2103 0623 0292 9022 9242 8612 8122 7822 6512 7072 6692 6482 5882 5752 6102 4442 4582 4082 3842 3252 3162 3522 3262 2482 2292 2442 0902 1952 1452 0882 0902 0672 0022 0492 0122 0082 0462 0201 9932 0171 8761 8421 8711 8771 7671 8271 8231 6961 6571 6121 6291 5241 5581 6001 4941 5691 4621 5291 5151 5321 4061 4051 4021 3691 4091 4011 4031 3451 3221 3541 3351 3621 2821 2331 2741 2501 2881 2651 2531 2851 1841 2181 1771 1371 2051 0881 1781 1581 1851 0861 1331 1201 1141 0681 1009811 0241 0021 0721 0241 0271 0211 0111 0439761 0779981 0309839941 0139839439479469979339158518908848368238408998127938468147718237897907858507677477717237657437387247097426787076526986326516696306606216496755866216875985896045746605665645685675595515425325615635675555215185275095405104975195294765125384974995164594384854944874794564684714744434334524694494194644484574734734394444064284474324014144074393744403893813853683923854163693623783703643833303633363373453903163043523603073323173603343163283373313393103133033253283003342973312632843062682802683133002902793122773082662982502662813142832652732862732722742572842392702572372572622642452532422602422502372522552432242682152272511981882362112012222152212092342021912032191861812052112202082102152042012392272202191881951961841941991862061671911811821731901672051981591791802001681821841941711682051681521701621731561651461721651761421601581511871561701631721451461541551591401571621521551391451661961441391631451551381351701321381471431571471311441351191411101301321071089810611592120106115102120117135122107123110129108891071049712610812512511012811612612110911411112810612012312912610210511210511710211010294101919689908573106116929087828576868410880686782807988747892758180721007288778983746592807280738368857872737490696573677471786066666562735973536166497252587164737758837266795578666659607063636437687058698069626769827670726978637356817674617862657260626466657567646162575069616168557267595460596161495958555873516073525052585553755547466549545042574249404044354748443965474553463733425039374134563441653946394634334839343226414444453841374239493939483647252711 842100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

240 79415 17466 779159 119363 837163 7441 729 5053 097 224712 6512 111 27818 866 6747 901 7393 979 0123 010 4712 455 8321 304 0463 277 2823 116 4965 404 97928 196 13916 471 21114 094 62616 602 93811 261 9475 995 65512 448 58715 267 64212 711 68726 618 27745 404 32251 845 33338 110 08143 671 30661 230 556121 516 006175 803 893301 372 153563 300 242904 317 971887 127 229369 878 65665 482 75724 649 8215 695 221868 8080051015202530354045Phred quality score0M100M200M300M400M500M600M700M800M900M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.1 %51 254 62099.1 %0.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %51 046 58298.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %208 0380.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %25 852 79850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %50 691 52298 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7 %3 632 7857 %93 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 226 6694 6104 1998 0754 38410 0239 33220 58016 76347 73041 23712 40589 55616 42114 962206 08933 47885 17471 35027 449143 7701 46394 643329 7581 05621 9351 7581 5861 7571 572 6164 6094 0444 9606 5126 3928 974205 271891 77316 8624 23430 31616 1381 63443 9721 7463 532127 5363 4306 4365 74814 6007 60825 37023 75038 78472 298177 36441 830 875051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped