European Genome-Phenome Archive

File Quality

File InformationEGAF00000701738

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

76 239 39346 107 99921 335 70411 144 9605 022 7703 309 1551 763 4131 366 437874 182705 168510 724421 853335 360282 494231 356198 444168 637147 882129 590113 247102 42191 52982 32974 52067 21360 41755 93050 82446 80743 69240 11937 96335 12233 25530 92729 06027 48626 16724 75923 29622 06420 99820 63119 28518 14217 60516 84016 39016 04215 25214 54314 15913 18412 73112 35412 11211 93211 37711 06610 94810 75810 24110 39010 1089 5879 3719 3779 2428 7828 7988 6878 2698 1608 3638 2717 8587 8897 5817 5187 4637 4927 2387 1196 9156 7666 7626 7136 6566 5086 6506 5816 3296 3876 3826 2056 3686 0735 9285 8695 9275 8055 8035 7095 7425 6485 6995 5295 5685 3865 3605 3585 2455 4045 2415 1835 3865 1715 1645 0955 0225 0535 0865 0254 9484 8284 8114 6784 8544 6164 6924 6474 6724 4624 6164 4954 4594 5384 3374 4274 5304 4094 3804 3294 2274 3444 2944 1704 2014 2264 2104 1774 0423 9833 9904 0573 9863 9353 8493 7783 8383 7013 7383 8013 8353 7183 9343 7033 6923 6753 6003 7013 6833 6093 6443 7313 5993 6343 6783 7023 5693 5863 5373 5863 5703 6123 5973 4263 4793 4733 4173 3533 4123 3873 3863 3333 2933 2833 2143 2433 2143 2953 3223 1493 2113 2383 1173 2213 1403 0993 0653 2433 0883 0603 1133 1013 1583 0863 0743 0673 1133 0753 0823 0953 0083 0573 0403 0973 1793 1553 0303 1143 0703 0602 9942 9812 9672 9502 9652 9413 0342 9163 0202 9042 8262 8782 8752 8732 9122 7802 8582 8842 8972 8812 7702 9412 8192 9102 7922 8082 7652 7972 7252 7112 8312 7412 7202 7892 7672 7972 8552 8142 7692 8052 8282 7412 7292 6552 6652 7022 6862 6102 6462 5692 7282 6352 6502 6252 6492 6042 6662 6172 6032 5982 6892 5582 6062 5652 5892 4312 6032 5682 4802 4742 4822 4662 4392 4942 4652 5102 5092 5022 4942 4612 4382 4352 5102 4592 4142 5042 3342 4082 3442 3842 4162 3772 4882 4252 3622 3152 4142 4162 4202 4152 4022 4002 2712 2982 2552 3702 3602 3152 2732 2962 2932 2602 2742 2712 2722 2252 2752 2902 2192 2112 2142 2692 2172 1932 2272 1642 1762 2092 2502 2362 1862 2102 2552 2542 2512 2052 1582 1192 0922 1352 2252 1562 1682 0862 0982 0422 1112 1182 0392 1562 0642 1082 0872 0902 0901 9872 0681 9602 0362 0201 9091 9751 9981 9931 9971 9211 8481 9441 9601 9541 9391 8741 9481 7961 8621 8701 8481 8571 8081 8121 8771 7771 8271 8011 8271 8591 7761 7791 7901 8541 8561 8121 7641 7771 7171 6661 6971 6981 6731 6741 7211 6331 6601 7201 7331 8231 7161 7031 8161 6331 6821 6371 6991 6251 6291 6361 5911 6871 6151 6011 5341 5401 6211 5371 6671 6001 5311 5641 5221 5911 5081 5191 4861 4871 5521 4521 4681 4381 4241 4521 4631 4451 4891 4381 4841 4211 4461 4901 3721 4811 4761 4181 3811 3481 3771 4391 3651 3621 3401 2421 3151 3641 2981 2791 2631 2451 2201 2361 2141 1901 2191 3271 2751 2851 2041 2021 2151 2361 1841 1871 1631 1251 1521 2311 1921 1761 2061 1451 1321 0911 0881 1621 1131 1071 0951 1151 0821 1271 1141 0541 1511 1031 0501 0831 1171 1501 1031 0921 1001 1351 0821 1271 1251 0269981 0351 0161 0291 0481 0471 0329891 0201 0711 0381 0491 0259891 0119679611 0679831 0209339749489891 0039449309571 0209219088988889299249139819099238699219188768549138508689248558758738568418129108438118727797508018307627937598837857598257777407238267858158197547958047937908408037367387787717467487097467287187366897127117197207206777017226927076696626716216426806496847066716576456426186516036556206035766376425595975605656026095895925765385885505165415615685595195435035565665434685595405655145214975395014634774935195034995134735135394895085144574784654764554164104824504224133994414474064294224224173763814173853693754003993733894003644083553623763933483703273753513223273413413523673152993213463153173143143153043213092943052962882702462973012623012692742692412242352202232172362232302102312191951982182282112452202262172092232072061832102252072212091982202191892101672141881592041731621711981771501571581761731711611571641581431621651661581611501511361551391521351371311471431071221031171221301161141291441321251331181131189112211896100127117112115103114112105971011121001058898858283889385699085921059289748685947610467829255687080636753725455605268414450505767605253685246495347534556614558583841466052385358505155374033384259505348594751584244444548406041454241503631443950394310 205100200300400500600700800900>1000Coverage value1001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

25 97800001 476 4531 115 6565 534 4575 215 5842 064 5793 156 2761 120 9881 036 7932 251 237876 5762 404 5122 479 9922 398 1844 367 4772 610 7373 098 0072 281 3254 298 5095 775 5376 614 9017 407 39011 537 61111 956 39011 002 66111 276 90824 470 29233 848 52223 641 61934 382 73949 193 83668 353 09943 491 09585 774 17979 230 85795 766 636119 751 634148 204 92400510152025303540Phred quality score0M20M40M60M80M100M120M140M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %12 156 34599.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %12 117 98698.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %38 3590.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %6 129 96150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.5 %12 081 23698.5 %1.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

39.5 %4 842 90039.5 %60.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

524 6682 0731 0603 2001 5281 7671 5572 4802 43414 84010 0724 36118 4744 6263 16254 9065 54121 82715 7015 11728 81993427 68198 8409266 756773775826520 0381 6451 2361 2001 8141 4302 24657 842204 8736 4002 1607 5066 0601 82213 0623 1304 47049 8607 2128 11610 02816 5147 77421 68616 66223 21439 83873 75210 282 608051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped