European Genome-Phenome Archive

File Quality

File InformationEGAF00000147332

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

887 357 555657 523 431359 674 238159 537 89960 854 81220 901 4226 938 9272 451 1501 068 405598 295407 318307 254243 560197 530162 974136 564119 238104 22091 98681 95975 34168 35564 10860 04356 44851 58348 80545 05842 41739 82937 75735 64034 37732 58230 83829 56728 48026 87725 53525 20223 88923 18622 48521 49720 73520 16319 48518 99018 68617 65216 50816 07515 89415 46615 32414 50013 99714 11813 70013 04213 04912 56111 88911 61611 25211 18810 59810 47510 1069 9829 7119 5759 5499 0039 0368 5408 5008 4738 0507 6687 8127 7377 4987 5327 3237 1207 0097 0146 8056 6786 5036 3566 1556 0075 9605 6365 7605 6655 6415 6245 6865 3125 2755 1755 1645 1875 0084 9834 8224 8264 6784 6344 3104 2924 3344 3374 3164 1584 0624 0254 0094 0063 9573 7403 5693 4493 6023 4863 3543 3963 0673 0553 1083 0882 9102 9752 9142 8282 9312 8982 6902 6962 7252 6242 6152 6032 5032 4632 5282 4652 3892 3082 2112 2672 3152 2212 3572 3452 2982 2942 1892 1582 1252 0932 0492 0381 9901 9552 0721 9591 9572 0131 9721 8961 8551 7671 7461 6731 7471 7651 6881 7041 6021 6501 5901 6141 5451 5861 6031 5821 6201 6811 4561 4231 5271 3621 3671 3911 3421 4281 4041 4371 3841 3991 3501 4041 3161 4461 3261 3141 3521 2651 3051 2491 3111 2711 2961 2541 1771 2391 1941 1601 1201 1971 1081 0141 0491 0201 0311 0441 0231 0549959881 02990293298992499396297197099596591584190591194693491088991084984786286687090680983682980874280483381878181979482082778684380276377078172079377775974475171872667366771265369772474376365867767968170666262063863562060961659556363458260765561665660162365067863356459961762661063359160363959854559957458956462954858452557256855755554958954848754251049351951949754054048651847752256054855951455049451649552754952352853648851749650848850951448247449745545845245444347842739943745643844849241746447945945447839944545143742242243438446345240344243545643142145447243048443339745240250641541140443838943544038740742541341840140739944340045743041243543143542141042441146742444842040541745146049947847143743843642043142741742141242145239939137937639441239937935339735534938638337435535933936234432336536238438935639833836734437135634333934531133535434936431230733332530831637334631332832431532033434631731433731830530029831030828131431531733432729730631631829630828428727729530026929828929328527429028930328326029229727926024329829126827325225428026827128426326126026429827630125727522528226728328626126925727923224127224721624322724022225122021622524523221022222424021921722123122023622822822520620823321522220126724223023323822625523822524821720423320220120318122019519422216620319921218520919723120622921020520119018822018920417819820221521217416720518620320419617018720321219319116418916515216720420517818119818519018420019517817918919816818819317817418819616417119815619716718517517717018817217418118418317716516718718918617115218317417715118820117417718416818018015216017915619315914617817616015817716716419817418119015916716217217716216817515917518216117614716015315217214816017316017116515315316917513213914715715215815114615914714316015016215516112615514813314914715713014114015312510512814113013915213411513910713413910712613914914215012513612615312313413911613111112912011412111810613111710610998138107114127140118127128119113921401211421141291201041251111411491191271001131101121111199211795105111126108102113118113113102111120110103106135114100102110861169711612410496106112888499879291121931061099212194841091039393827610794116999793849310397100831069495898680949398104101911078792989598100829311487114901061049095101869999107891041178610610855 867100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 748 18821 140126 994577 5841 091 415914 7008 817 2836 348 1103 184 1795 319 47640 794 73019 135 8908 175 2054 949 0664 791 498782 3341 274 8905 482 99114 645 64135 192 25741 900 98428 342 08225 758 59327 067 06538 040 09416 922 41115 364 52528 548 04526 752 50746 954 01646 135 111103 507 39683 924 063128 937 033189 648 970282 367 647466 383 762966 911 6251 370 946 174617 044 105111 405 70519 463 4314 580 32802 039 9570051015202530354045Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G1.3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.6 %47 855 30498.6 %1.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.2 %47 649 48098.2 %1.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %205 8240.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %24 266 59650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.4 %47 255 30697.4 %2.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

0.7 %354 4140.7 %99.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 869 92225 9199 73065 87514 06416 46016 18121 82120 455236 93556 09634 21395 93531 83427 935212 94037 401129 68680 73516 972128 9586 19052 616274 4993 78932 1643 8103 9623 8132 288 3755 0193 8624 5045 8224 4626 658122 777597 44512 9546 30018 78615 5925 94629 8127 95410 83276 23014 45216 09617 52829 87811 76834 18026 26037 80460 000120 64639 410 340051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped