European Genome-Phenome Archive

File Quality

File InformationEGAF00002386399

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

205 183 201360 765 088467 887 391488 226 437430 089 275330 073 570226 028 364140 771 51581 030 13843 728 44622 566 98711 322 8725 711 4153 000 6491 727 4621 116 925798 044620 987503 959421 525360 364310 206268 645235 355207 346183 968164 175146 893130 352118 188105 57496 47587 72081 54375 03069 01063 17859 23356 62051 28447 99545 46442 73140 73938 92037 81636 25334 71833 51931 49530 21929 21528 31127 40126 59625 58424 76523 01522 23822 36621 32320 98220 62419 47319 08018 58618 34217 21716 92016 83815 94315 62315 13514 46014 10513 49313 39213 55413 37913 02812 62112 16512 07311 41011 39811 10810 85110 94310 69710 72210 39110 03510 30010 35010 0089 8909 2118 9088 8128 8518 6968 4498 3348 1668 0057 8127 9547 8577 8647 8667 5487 3807 0936 9736 9217 0636 8056 7286 8776 5736 3426 1066 2016 0536 1075 8955 8725 6365 3195 5915 3225 3605 2005 1754 9665 0474 9224 8054 6994 6654 7114 6844 6344 6114 5814 4794 3194 3804 2334 3694 1564 2054 2304 2204 0433 8113 8224 0153 9673 8343 7893 7033 7023 6763 7333 6343 7713 5173 3953 3363 2063 3063 1103 2313 2703 0753 0633 1592 9723 0962 9252 9742 9772 9092 9102 6692 6942 6452 5912 5362 6202 6282 6032 5402 5002 4042 5632 4662 5502 4562 3812 4732 4012 2342 4042 3552 2182 2712 1962 1492 1442 0332 1702 0732 0652 0261 9891 9101 9351 8351 8671 9321 8211 8431 8601 8841 7971 7441 7551 8401 8661 8101 8081 7351 7201 7491 7251 6961 7411 7401 6391 6541 6811 6221 6241 5931 5511 5531 6051 5051 5011 5611 5201 5441 4981 5031 4121 4681 4581 4521 4871 4161 4861 5111 4451 4961 4561 4741 4511 4301 4101 5301 3911 4151 3611 3931 4311 3581 4641 4411 3441 3151 4241 4171 3641 4151 4161 4791 4621 4951 4351 4521 4831 4291 4061 4321 4621 4001 3991 4381 4441 5141 4911 3961 3911 4451 3821 4051 3771 4711 3951 5781 4201 4201 4691 4291 4211 4231 3651 3611 4041 4101 4041 4261 4011 3411 3361 3651 4051 3921 3611 4271 3511 3471 3461 3411 3281 2961 1451 2031 2211 2151 1661 2181 2111 1831 2021 1491 1811 1801 1391 1601 1261 1021 0781 0491 0461 0491 1391 0691 0561 0241 024964990969930890894830929891815848904883919870850814852804838808814880789883839759717764775662640713701700691653697707737805692706722694602672660660715726662709735696781778800785742704722689663685680630583666679680622626626664592614625565614607570660656653683614657674628603695575617620654772621671671617686660680629620630631684656619680624625634645596577630559580601545567544559563506550580607564571564594561539568539554589599577582585548546601549600560491513521532553538557536513501502610567539535493515536482541489459464510465476484472483470491439499434439463432469448433419420405430417409424412381396400418368393381431479388359361372365348404374339352341370343333401387372372353361402329362332340351359325304355389392359353349373336329318355306336310320297309315335268315276313282318347314340340336363317321337336318324302301334340313354314347339297335294315335294320339344308326347304330318337307299326303293327281297277316269272221271267295269272274275257291300266238264283256271263270255300256244263258216245263267257269284239262234237240234237233216259252229210246252213239211226241265240265223221253222242248219229269246218267237247245254236224224215215194208255230207248233224250218244201220220238232202223244220238231191237219244226227228215233212199208216169229221193228225219200205203219187237220219222226206208226219226233256248197215239178221217233221198236234212215228242195228232231231209209213241214175210219201227221229226205231220229188210221215181202234219238242261221229198218234214210228218208194211236183207217211230215229218189208207233219215186206239244199212221205202205198243245193261192224211220229222253228209238216199221226220208208207198180218222200213240228227247184221211242216226221201218197216229210180191225222219211224229215236239215227238234217231233226227246230238225227244231237225239212226213218241204233174213218231235216223219241221219207227179223212215221229209218224177 817100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 497 6760000000147 824 2960001 323 722 085000000000705 139 9470000709 351 52000001 198 984 27800002 137 808 8340007 549 282 83400510152025303540Phred quality score0G1G2G3G4G5G6G7G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.3 %90 622 00399.3 %0.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %90 075 97498.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.6 %546 0290.6 %99.4 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %45 607 98550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %88 901 38697.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.5 %5 014 1005.5 %94.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 384 04098 11159 336114 09487 04287 067110 995131 47855 255100 56448 91542 45668 59467 66935 10677 75557 88164 15992 925114 169110 939107 467128 617103 535171 034267 88316 902448 33425 99325 01355 95351 45621 59761 47025 70125 73649 60957 02414 63682 7851 376 03361 79768 74892 98187 474149 583126 218206 859237 85645 86049 61650 50154 94838 76164 57864 71753 511134 98960 63994 81780 629 806051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.37%99.37%99.37%99.38%99.37%99.38%99.39%99.38%99.36%99.37%99.37%99.38%99.38%99.37%99.36%99.42%99.37%99.37%99.43%99.35%99.38%99.39%99.59%99.15%0.63%0.63%0.63%0.62%0.63%0.62%0.61%0.62%0.64%0.63%0.63%0.62%0.62%0.63%0.64%0.58%0.63%0.63%0.57%0.65%0.62%0.61%0.41%0.85%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped