European Genome-Phenome Archive

File Quality

File InformationEGAF00002551497

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

49 972 966106 162 280179 893 916257 425 187320 546 908354 087 006352 636 837320 707 201269 123 454210 510 633154 503 117107 356 48171 081 10245 178 80727 767 92716 688 4669 892 7995 894 6983 594 6432 302 1191 561 8061 143 309897 580738 423628 984548 238488 286436 333395 142356 780327 254296 203270 525249 508232 285214 766196 351182 144167 636156 429147 197135 244125 399113 657105 82198 22691 74686 50280 89477 13771 56066 60262 35359 81856 24353 39350 99949 26046 77844 65443 60642 30141 24539 56137 36936 55735 24834 11432 65331 69431 22029 44929 21128 55027 20226 79726 43525 17224 83024 58223 92223 22122 89822 42622 09221 08021 18620 02019 67519 23518 67218 34718 13817 33717 23516 51116 64416 36215 56615 22714 50814 49214 38114 40414 06314 07214 10913 92813 33013 02013 20312 63012 55412 25511 99411 68211 53011 30711 01011 39311 19511 14310 84210 74310 70510 46910 87110 1669 9799 8739 7279 9299 6839 5489 4309 2399 0548 9988 9158 7438 7668 3118 1918 2197 9227 7557 9817 8027 8477 4577 4477 5867 5757 5257 2046 9417 2167 0686 8756 5536 6816 6126 5536 5016 6786 3666 3806 1266 1956 1386 2065 9856 0205 8486 0095 8375 9055 6795 6965 7565 6695 5225 4485 2555 2575 2195 4735 2705 0825 1064 9414 8454 9084 7854 8564 8334 7804 7394 6804 9044 5564 6674 5414 4054 3834 4194 1584 2944 1324 0824 3114 1304 1243 9363 8954 0664 0723 9583 8713 7573 8963 8073 9023 8613 8313 7953 8563 8083 7693 7143 5973 5813 6863 6643 5613 5043 4533 3983 4013 2663 3943 1553 2353 3103 2133 2323 3723 3573 1883 1753 3383 2243 1283 0533 1323 1092 9662 9962 9862 9472 9472 8872 9012 8552 8322 9002 8732 7412 7492 7072 7342 9082 7742 6582 7642 6822 6502 5952 5402 6502 5412 5412 6222 6132 6242 5192 3572 3882 4892 4192 4702 4182 4682 3202 4092 4482 3792 3082 2442 3862 3702 2742 2562 2262 3632 3132 2092 2422 1482 2422 1052 1932 2192 1212 1812 0562 1232 0142 0711 9582 1221 9842 0021 9811 9101 8951 9401 9731 8931 9511 9992 0021 8601 8171 9361 8521 8831 9411 8611 8291 8861 8131 7201 7141 8061 7521 7281 7681 8231 7711 7361 6711 6131 5691 5861 6251 5901 6541 5701 5801 5491 5531 5921 5951 6451 6621 5811 6611 6371 4671 5701 5081 6121 5751 5191 4661 4401 4771 4911 4311 4251 4151 3601 4501 3711 4461 3681 3861 4361 3901 3861 4411 3921 4151 3151 3391 2971 3421 3221 3391 2381 3111 2621 3121 3251 3131 3441 2881 3391 2941 3191 3311 3431 2381 1831 2221 2271 1861 2921 0931 1501 1411 1391 0951 0751 1501 0661 0311 0771 0051 0371 0301 0581 0661 0761 0521 0141 0891 0671 1301 0541 1371 1071 0881 0641 0511 0121 0309799809589941 0381 0169669741 0151 0219811 0119801 0119679801 0661 0099779381 0119831 0281 023927949936915885965963956914824833859886906880851882829808910932829893893917814814852845815840866796831854804862788830815818840837810817775767773795772798809853817776814779714749801780787708791756779797792751788679758749714773721720725670749703798761727781693653695715691688747715723682687699704718682697686680701613686696694642655700663657650661585635640598603629666621638710649692580618670678685646653658660689699610644601635601691665636634635699638616644641650641636600626609598620592562580559588588563605614560578661582585601542541546542525534563564563595593579584546560560529535553517537559564530561547563613586597590574568582586582538545555538518577516534536535539537593489519548527566559548552511569528556572573560576545554539590548593574525560487606571521588582583518566523510539582555570545542543576589523567549521507526508507544548523528543532508546528550574503532528553506497536558525531509532505477510486531526462486520539521498502473489526509475501480466478482477488496516482456488430469446462474455469459469458435460418419422432424434446471442433456411432463492445405448422431422422414456441442409422462411420412414418415389404399423400405402408404395402374400386401394400402396369358380403365384408399403347395392397427402436395371403367355362394416390382383397414410398378442364421377363391386380377364391377338350350348352340358344338327346342348328313327333319356304339311314339299313286285307337289333337305316335332330336327318332308333312362317274296303295308296295324345303305299315320296297277282273261306306283298329297300264267282296284287291304318284267295282318 160100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 324 055000000032 783 2170001 028 366 520000000000587 936 7660000673 933 72300001 452 677 59100003 089 123 48300015 619 534 25300510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %148 281 17999.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %148 072 71899.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %208 4610.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %74 459 20450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %145 227 19297.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.3 %9 398 8616.3 %93.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 564 683170 195101 929197 631149 707159 353178 009240 671106 004176 38181 63472 07099 664115 27158 283130 63897 167110 110139 505192 575199 136187 543219 518177 214285 315468 99530 645790 37244 70444 06186 69592 01141 269108 27644 01344 38172 64297 68524 943142 7552 072 96391 36588 592150 165127 087236 201197 865316 440482 09857 72375 07864 55584 08441 05970 40877 13953 852202 89357 349114 759132 121 958051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.85%99.85%99.85%99.85%99.85%99.85%99.85%99.86%99.85%99.85%99.86%99.85%99.85%99.85%99.88%99.86%99.85%99.87%99.85%99.86%99.85%99.91%99.8%0.14%0.15%0.15%0.15%0.15%0.15%0.15%0.15%0.14%0.15%0.15%0.14%0.15%0.15%0.15%0.12%0.14%0.15%0.13%0.15%0.14%0.15%0.09%0.2%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped