European Genome-Phenome Archive

File Quality

File InformationEGAF00002386130

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

29 975 68973 762 276143 698 171226 965 783302 011 675349 515 777359 822 459335 628 318287 771 471229 396 506171 810 795121 788 11582 409 28653 634 35633 774 97820 827 56712 699 3527 719 3554 773 4053 035 4492 014 1611 419 9241 049 764821 922668 962564 986488 542427 256380 295340 021304 526277 581251 757232 615211 746193 078178 475164 423150 378141 261131 567120 079113 424105 01597 27089 85784 73278 80373 57268 71164 74860 09256 99353 52950 10347 69544 30542 71240 31438 94637 22035 48434 54233 72332 41131 12129 77429 09928 78828 22627 30026 05925 56224 26423 74623 03122 95722 74821 25321 02820 49819 66819 74819 03418 67418 81418 53118 43417 58217 35116 79516 38215 84515 39315 41415 11814 97714 12713 85813 76113 55413 71713 01412 65212 72912 59711 83811 92011 50711 51311 22011 04610 91810 90510 77810 61010 32310 31910 1969 6509 6329 3109 3589 2369 0569 1268 8678 9098 6328 3818 2008 1327 8937 8837 9977 9287 8177 6997 5097 4737 4077 2357 4037 1797 0066 9397 0616 8546 7346 6496 7776 7366 6206 6496 3536 2516 0296 1016 0656 0075 9525 9265 7385 8745 8125 7805 5505 6845 7325 5025 4715 3015 1665 1665 3015 1214 9645 1054 9394 9434 7794 6264 6854 7164 7664 8854 5484 5024 5664 4074 3234 2254 1614 1124 0504 2834 0094 0143 8993 8803 9783 9083 8643 7483 8433 8773 8983 7233 6393 6793 7933 5123 6483 3533 4963 4213 2403 3523 2613 2743 3703 2973 3743 2703 3083 3723 1923 2123 3143 2683 1573 0452 9152 9732 8912 8362 7312 8352 7772 7552 7302 6682 7422 6782 5942 5382 5922 5602 4592 4832 5982 6212 6102 4532 5292 4072 4462 5262 4652 5282 4712 5342 5992 4732 4302 3802 4262 4392 3582 3792 3272 3262 3592 3202 3622 2662 3642 3522 2562 2352 2502 3032 3102 2492 1912 2012 1522 1882 1092 0332 0021 9212 1102 1512 0102 0172 0072 0442 0281 8291 9291 8781 8661 9501 8541 7691 9441 8421 9241 8091 8051 7171 8691 7391 7351 7421 6741 6571 7551 6251 6571 6081 5821 5501 6501 6711 5931 5821 6011 6541 5081 6091 5921 5471 5931 6131 5871 6011 5401 5431 5871 6301 6021 6091 5591 4431 4261 4411 4361 4621 3751 4791 4651 4281 4001 4711 4731 4061 4011 3981 4071 3981 4481 4051 4681 4601 3851 3481 3921 3711 3621 4551 3391 4811 4021 4021 3871 4011 3881 3531 4171 4691 3611 3261 3021 3191 3591 3201 2411 2991 2731 1761 1701 1691 2951 2341 1891 1461 1541 1301 2291 1621 2491 1291 1661 1951 0911 1711 1541 1731 1121 1411 2201 1491 1621 2571 1631 1681 1771 1571 2301 1831 1301 1091 1171 0271 0801 0431 1191 0451 0191 0821 0601 0701 1101 0371 0021 0661 0691 0639279619639679649979511 002942929942905928902911933897910864829942854866899882914874892957923905821839895813858811754833796764771765743708715736754751725744781756789734724787748757694696698708730715692709721708711693681660673651685714673667736705701680662676650707665607637647635643647689647618646643724614653645624618660680610647636612567608573611612637620581632606623580585568523605544565517543581551565548568555558536571572569534495482522555519507552547579557521522561586561533499511525488542501536506546541523510535520551510524502526556485585495477545510551513533554535583546565508555506501514504486510476499531479460496508492472518513513455497506508471537524459475462448459462490495465495459467465435398435448408423464416416444448471461442451419438387399445428407401414393431420454386428453430438355396398410423356431367398394389426412379357419391370388352368384389372353405405372376429398373405391398378356330353396398405366359357355334376353363356369360373313352350371367334363317308334320310309313331295353291321348349312348302363353339307350301302317359312327345309322332383343298348370389368363328301321327285301345294305314311325333296308310327299326347309349317273329299277295296276324312353282304292281306266284301289262303278275283291281280270283244258281274301256268244286282261272308232258284273238261273237255258263269285275268267244300288262284264272250271251270255267273290248261270254245264285283274286267282298287271297298277295280294292259303255266274275304266286310288270288292271269292241293249258270297252316307292319292282324284313309283320280285282286264258305265285267274251270251253267270286282292277351290281300334299322274259289283275265277279270281268274 125100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 737 823000000037 674 059000948 950 486000000000531 535 4530000559 735 34100001 259 574 89200002 622 053 92100017 401 233 48300510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %154 024 54099.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %153 582 74699.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %441 7940.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %77 365 87950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %151 253 72097.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.4 %16 102 05610.4 %89.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 627 587165 723100 483193 708147 882156 346179 198235 164101 320177 75581 60171 164103 756111 63656 832130 24388 53796 811136 788176 253176 346176 753209 792166 549270 862450 45830 083768 44745 54043 68493 46288 03939 118103 43343 97443 64974 18895 32824 684138 5692 156 43294 14690 847149 103126 956229 916205 326284 580495 38057 65876 67565 00586 36343 23473 80177 85056 411206 63959 361115 697138 297 499051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.7%99.69%99.7%99.71%99.7%99.71%99.71%99.7%99.7%99.7%99.7%99.71%99.71%99.71%99.69%99.74%99.71%99.69%99.73%99.68%99.7%99.71%99.57%99.59%0.3%0.31%0.3%0.29%0.3%0.29%0.29%0.3%0.3%0.3%0.3%0.29%0.29%0.29%0.31%0.26%0.29%0.31%0.27%0.32%0.3%0.29%0.43%0.41%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped