European Genome-Phenome Archive

File Quality

File InformationEGAF00003612201

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

57 652 55339 451 00819 842 67111 511 7714 605 7313 214 9671 194 6401 149 082480 025534 435274 879302 455196 359197 679154 355148 208126 328122 790109 309105 77698 95895 02691 16786 61482 92179 15074 88871 33370 24867 76366 25665 04263 17862 16059 10758 70455 85054 06553 85752 35351 28049 57649 76348 37847 73146 58646 11444 83043 70043 64243 11942 42341 28841 27040 10239 65739 77439 17437 58138 34337 67636 25937 06336 27935 71435 08234 85834 59534 16034 04033 94733 79233 78931 98832 01932 60931 88331 58830 88331 04430 27931 04030 47929 67629 49129 58628 65828 49228 70228 55227 86427 48127 38827 21127 07926 78126 01926 30025 68025 80225 60025 41624 97124 86724 50424 12523 31123 81823 71723 31223 11922 66723 10622 82322 96522 57822 77022 41522 24522 46822 07022 32421 63421 95721 37321 22321 38621 43721 12720 50120 35620 53220 91720 40219 97319 67519 94919 84219 51719 47119 83119 74518 93618 68618 37618 43518 25118 54018 41517 89818 08818 03317 89017 84317 79517 48017 70017 75117 49716 88917 04916 98616 93616 53216 66016 51016 52616 36315 99516 22415 57715 58215 60815 21315 25014 98314 95614 73615 38914 91014 50914 59814 68014 75614 66714 62214 56214 12514 18214 25414 30914 24513 67813 67213 60313 35213 51813 32413 26813 25912 98812 81312 98512 87912 85012 65612 41812 39212 50812 46312 34412 14912 22812 02511 88811 74111 68211 64711 76211 63111 30611 17811 37711 11411 16610 85610 67510 75910 72710 81610 64910 62210 37510 37110 17810 1859 8699 9809 8549 8149 4859 6769 4909 6939 2959 2799 2839 0829 0629 0838 9588 8718 5708 8538 7618 7478 6198 4698 4878 4818 2618 3288 3878 2278 1847 9617 7997 8427 8677 8977 7977 6217 6127 6317 4857 3397 2187 3067 3256 9807 0837 0856 7726 8787 0026 7996 7816 8186 7366 7626 7376 4416 3746 4146 3666 2486 2126 2176 2846 0176 1455 9915 8545 9515 9555 8205 7385 7035 6355 6765 6145 5645 4895 4475 4835 2795 2295 0835 1975 1115 0405 0254 8854 9165 0544 8894 8674 7434 8554 6474 5304 4544 4844 4104 5314 4854 3504 1964 2814 1754 2134 2054 0364 0514 1593 9373 9813 9584 0004 0233 8563 8403 8423 8843 8203 6863 7393 6013 7283 5683 6263 4833 6293 5763 5693 5043 4333 3793 4533 4513 3873 4223 3483 3573 1963 2433 2163 1413 0703 0033 0613 0232 9382 9702 8422 7942 9042 7032 6172 7322 7552 6782 7242 6502 5492 5882 6092 5442 5192 5432 5882 4692 4582 4492 3982 3682 2462 3902 4522 3292 3552 2902 2482 2032 2412 2592 2162 3012 2782 1182 0811 9942 0771 9642 1172 0082 1612 0902 0682 1182 0361 9812 0502 0661 9361 9441 8791 8961 8491 9221 8241 8751 8821 9591 8651 8531 8361 8891 8531 8341 9181 7851 8291 7241 7891 7131 7471 6961 7551 7521 7231 6141 7041 6571 6421 6761 6991 6861 6231 5771 6471 5601 5991 6101 6491 6211 5501 6251 5611 5681 5661 6421 5391 6301 5671 5391 5701 6071 5601 4561 5391 4911 4451 5011 4391 4311 4311 4671 4401 4441 3611 3881 3871 3631 3561 4931 3631 4241 3091 3251 3031 3141 3701 3171 3391 3041 2971 3791 3221 3191 3041 3451 3771 3711 3021 2571 2271 3231 3211 2851 3351 2341 2531 3061 3111 2271 2891 2631 2791 2251 2691 2741 3001 2271 2201 2221 2111 2781 2071 1981 1731 1771 1671 1851 1741 1821 1431 1491 1161 1511 1541 1191 0991 1291 0781 0871 0441 0911 0591 0611 1021 0821 0431 0101 0131 0741 1029449791 0271 0489951 0159359459919999401 0091 0139909491 0331 007964979919950959994926899910891944926905902894916887833846901906911872821822814790731755771793909735744788799783747792753759718677759677732665683663686639603643571703530596577569526519525531502540486475507491517461462445428441465443418418403385379399362384394352339320309336337324328363335293325302305316286268279267270291247258249247238229254214189203226209233209186191221189168202189171189155166158172155167148173139156141136141135142119133110124132123105130128115144125123111117123105101115106102120108115117100107122111118899910210186104801008789786667627277678286726978769674726383884964625658406269644548445849534155463532384938294248394527414740493541453138323833363033262633373027232322242920191816321816182224201317201929241927242822232316202220171124122420272517251714182023202121232423222016201914172215121591819817131071414988412109101111121314910121013136138648118769813118691478551810108913161017141452010151915181116813761117111413158814149123 461100200300400500600700800900>1000Coverage value101001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

274 84400000004 291 25100058 231 13400000000035 933 591000042 329 714000086 434 5190000178 694 432000945 975 91500510152025303540Phred quality score0M100M200M300M400M500M600M700M800M900M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %9 000 32699.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %8 986 77699.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %13 5500.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %4 507 21850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %8 899 25898.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

51.6 %4 648 98051.6 %48.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

200 5303 0621 9534 4062 3222 2393 5943 6533 2694 5121 6721 2772 5182 4371 1033 5461 3051 3442 4752 9432 7344 6073 8903 0945 31710 24361837 3638278622 0821 7001 2023 3707689731 8831 7046054 38359 6962 8662 3714 6243 4737 62710 1466 91533 6032 1913 3662 6063 7611 8972 9183 1472 69111 8412 7445 0268 558 998051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.85%99.85%99.86%99.86%99.85%99.85%99.85%99.86%99.84%99.85%99.85%99.85%99.87%99.85%99.82%99.85%99.84%99.89%99.85%99.82%99.84%99.85%98.91%99.86%0.15%0.15%0.14%0.14%0.15%0.15%0.15%0.14%0.16%0.15%0.15%0.15%0.13%0.15%0.18%0.15%0.16%0.11%0.15%0.18%0.16%0.15%1.09%0.14%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped