European Genome-Phenome Archive

File Quality

File InformationEGAF00005190973

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 289 7407 774 05614 440 46227 679 65450 526 71484 643 337128 984 198178 758 260226 723 711264 657 616286 387 012289 139 622273 670 023244 174 887206 233 952165 612 594126 902 18993 108 70765 623 59744 634 44929 395 30318 827 10811 800 5917 307 0594 535 4312 843 9181 846 1491 252 602908 180697 753568 300486 012430 426386 304349 831322 528294 527271 662255 499235 745217 377203 799189 436175 963164 133153 600142 655133 553124 974117 456108 709101 06794 80189 62384 45381 40876 52473 50869 37067 01564 49860 97057 91955 46151 79549 86748 75745 63543 41540 60339 12536 99735 61533 17432 07730 69628 97527 20625 60824 39823 51822 88721 44320 09919 07418 32417 55716 76216 34715 85715 51915 04514 46514 14413 33713 70713 22712 61612 62812 29811 95111 61811 43910 89310 87010 59510 72510 04210 1469 7679 6059 5949 4828 9928 9188 5588 4808 0867 8827 9507 9597 8857 5787 4557 3446 9077 0286 8236 9496 4656 2556 2386 1786 1746 2945 9086 0685 6515 6855 5025 4555 2455 1805 1555 0725 0055 1034 9474 8414 6484 6574 6704 5284 5234 3504 1554 1064 0214 1464 0983 9923 8563 8763 8193 5773 5683 6273 5913 5643 5743 4823 3103 2533 1853 2043 2253 1623 2313 0903 1072 9743 0943 0763 0262 9092 9982 9812 8862 8842 9522 7752 7042 6892 6302 5332 7672 6862 6782 5262 5962 4992 5912 6272 5952 4662 5062 4572 4972 4382 4992 5642 2872 3162 3212 3922 3322 3272 4032 0622 3002 2012 1702 1292 1561 9822 0241 9651 9981 9401 7872 0092 1022 0241 8741 7731 8271 8001 8051 8901 7071 8951 8641 8091 8031 8941 8391 8121 7671 6221 6901 8221 7571 6951 7051 6481 6501 6321 6071 6371 5011 6541 5211 5411 5661 5451 5951 5421 5011 4851 4821 4331 5191 4391 3431 4671 4051 3801 4081 3721 3831 4131 3981 4631 2971 3931 3661 2971 3141 2411 2161 2771 2961 2931 2951 3611 3221 3201 3111 1641 2861 1971 2971 2701 2231 2201 2721 1431 0921 2041 0511 1211 0331 1461 0911 1161 1391 0371 0181 1531 1071 1231 0971 0701 0041 0049969629219841 0661 0611 0679951 0349621 0489559381 001950948925929976964876904956898847825888934834917945940917862846865835845822838831771719763773864820824766784801846812809761829812826862837777767787794792663751710691693702700691674695686690681681718684730668696675697637633666653717628674616627635595592604599645636616637581583611604631630593606592622579630606591571524512525520545583556536514482561504471493470449525552514493498499549504537531511509436445531470545481463441435429423443415449450462489452463419415410459452428408423416414446427430453462449436417375419394404400406438438424442387422366452437370414403385364406444406398440420410417380394374406356398381411399380370381390408413422420399424448424454416365360353332360367354366314337350364347373365351377371383387341384359376396342382359347337323362337351312352307336320322344319374315328297305344322317323339293285294275283305279304242283287267289305281283319308312309295301298278322275275334314284324295273299267302298301318289307295282276301291305309329292290319285344321324275332298307276249278303315279254253274314310294305258297279265261268267273248256269232283237253259281284287259260230275277232282261263259231274251250252253235224221225232236247230223229218214249266256218239252226246229249226229247263247258247221252231258208244249249253217241231230229234232255267241281235234243267257242226197212253245244231229217221232322232245255269264250239223213203202242213233220192196231201177214203205204197201221194192174164194218214177172178190185200202198161219187186189186187189185158186176222221206194209235187223186166184171187163174175183206182195182178202180173187170191185208180184190159195234190208225221279213208240223188216188188187195185218180173196203194186183162186214176156166157191176184191163182210197190179158166171171175184176175183164145174184205187183157159170210160174201219188155186179191308189240212167190168160168174171180166193169156175182154172148184178189167169163169154172171175168192194163173186181185171200178167173150147168169182178185185204 628100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

711 12100000000001 454 333 55600000000000002 169 334 0670000000000032 657 548 82200000510152025303540Phred quality score0G5G10G15G20G25G30G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %239 329 56399.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %238 637 89499.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %691 6690.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %120 138 83350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %235 320 62497.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.7 %8 801 3393.7 %96.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 632 771126 58677 399178 234107 947112 207140 002327 69192 516142 42058 84851 28080 78084 86851 104116 92769 12876 616102 548134 243142 956149 836165 962135 461231 690422 40825 325923 99735 43334 45291 37771 64638 19394 35337 77937 44766 03479 56424 529132 6642 426 963110 03999 877175 686143 113271 200252 259352 075701 80164 70298 10279 424120 21154 140117 10293 24866 832277 54565 866145 453223 280 054051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M220M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.71%99.7%99.72%99.73%99.72%99.71%99.71%99.71%99.71%99.7%99.7%99.71%99.72%99.72%99.7%99.7%99.69%99.71%99.68%99.69%99.7%99.71%99.46%99.24%0.29%0.3%0.28%0.27%0.28%0.29%0.29%0.29%0.29%0.3%0.3%0.29%0.28%0.28%0.3%0.3%0.31%0.29%0.32%0.31%0.3%0.29%0.54%0.76%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped