European Genome-Phenome Archive

File Quality

File InformationEGAF00006164491

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

35 307 47885 715 552160 344 043243 686 588313 535 133352 837 413355 685 255326 528 615276 947 408218 988 133163 074 291115 192 76077 609 15950 334 57531 587 79119 341 88611 653 3556 999 5834 244 7742 652 2441 745 8151 218 058895 721701 674581 397493 875429 981376 712336 919306 259275 502249 885231 934215 229199 674183 313169 243156 403145 127135 537126 843119 336110 471101 52596 17290 38785 84480 56275 10470 91666 56063 05059 09055 64652 23649 91746 85645 09542 55340 37639 02937 18935 47434 16632 66531 62230 76129 17928 29127 45726 15825 92824 89324 41823 44823 21822 58421 62721 02319 93919 32319 51419 01917 96317 65917 58417 26616 96516 00815 92915 49615 29414 67114 81114 49013 86913 54213 56012 94212 80112 68212 76112 50912 27811 93811 52911 32611 30311 00910 79610 52510 54410 21710 0879 8479 7809 5679 4089 4489 0388 9548 5588 5148 4248 1948 2328 0998 0047 8397 5607 4857 6427 4997 5377 4287 2007 1986 9046 8916 8646 7316 8856 8306 7476 4806 7036 4106 2516 1756 1986 1175 7985 6415 7465 7155 7755 7285 4695 5835 6215 6435 5705 5535 3355 3055 1175 1504 9934 9454 9734 8874 8764 8984 7374 7955 0114 5654 6934 4914 6334 5324 3584 2214 3064 3164 1744 1914 1614 0884 0983 8663 9364 0554 0933 9323 9324 0523 8103 8533 8403 6963 6633 6993 5183 3573 4793 4893 4653 4483 5783 4573 6073 3543 4743 4893 4273 4793 3613 3593 2383 2613 2103 1413 2793 1593 2013 0792 9893 1392 9402 9312 9002 9942 9642 9802 9942 8322 9502 9172 8412 7102 8012 7942 6332 7332 6542 7682 5542 5282 6342 6422 5892 6282 5972 4982 3902 4252 5292 3592 4002 4182 4282 3052 3822 2982 3212 4232 3112 2122 3412 2702 2552 2912 1922 2792 2582 1902 2442 2322 2792 1462 2752 1592 1672 2522 0782 1342 0832 0012 0392 0482 0492 0312 0331 9831 9671 9682 0521 8991 8621 8181 8021 8561 8401 7961 8251 8441 8621 7671 8101 8251 6951 8391 7711 7301 8591 7061 8191 6851 6941 7571 7801 7691 7851 7131 6911 6651 6661 6371 6291 5901 5491 6151 5371 4961 5071 5331 5761 4731 4401 5191 4531 5521 4541 4191 4221 4691 4161 4021 3421 4191 3061 3811 3551 3441 3821 3731 3851 3231 2981 2851 2461 2721 2701 2681 2731 2261 2331 2661 2341 2821 2191 2511 2001 1741 1791 1291 1851 2041 1781 0931 1951 2611 2041 1321 1281 1161 1411 0931 0891 0931 1051 1571 1561 2251 1231 0821 1671 1011 1251 0881 0361 0561 0851 0271 0509819991 0511 0391 0251 1141 0491 0961 0181 0251 0571 0571 0491 0201 0781 001999957942979902841859830903916824946836826877839830831856867892863905874899913929903934913916850876857811829846867830816854830825839842812855823859843873852862962872838775881849880819821782783771800729757756732699793763810714738713746732716746768798757772698700727745754757688741736707722694733731636739715681709721688651689722703701649658621638616664601603650644649683661654630704661673626662631675667649694644636632700591621618628616625697659673630614600618640614604564589618640570586637666634645601630616622698664617674623622581571564605580586553616628548584564566555550565540564578530600560578595574576570587611586546561593556571593592568582557584581607625644535574612605538539625529518572548526495550534567567521496563568571561568537506535581530593528498549510544509515571501486512513481512460479473452454462438423477401466438428413417412439442407426410399406401423390391408421426357354367355346334294340338355277328333352339335362322349333321312300297307276291291275272256264317302281272308293298298300263295298308306300317338301288280258296274320279269289322347303324341312276301303311269302285292303352262292306298282277285326290303299283296261283265275261252246263255279290265254287291312279277286276290271307266293273305400290292289247277266271277276302249267267261263242265248270245260218262249255269227236262269252266280274271268250269242277232241237253253239266229213250238232230228221194215242232215244239241231236233238226222211228190220216233216191204226234201225213223232237242207220213176212209242221217213224211220198238206217193213202208222191203200233206202195228222202179219184192192188204194178209187187178184184206202205196211220211208186191176182202171196184160175185202177151203197192157193175192202175206200267 532100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 529 263000000050 324 176000798 557 547000000000474 658 2640000570 258 98600001 275 935 22700002 683 129 29100016 871 869 49400510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %149 877 41299.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %149 684 85699.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %192 5560.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %75 252 52450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %147 299 82097.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.4 %12 701 6458.4 %91.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 598 809141 92686 415167 884125 252131 794145 993196 12491 319154 44071 70261 54686 450100 55750 859118 83080 01488 788120 940169 799168 574167 709205 745163 165256 832431 65527 387747 55639 48438 09774 25979 56637 16197 27940 65840 39363 10984 41322 794128 1561 852 34787 87981 467145 152121 721228 541198 482297 058505 42554 33773 26362 28181 28537 29772 01471 02749 515202 30151 143108 911135 635 432051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.86%99.87%99.87%99.87%99.87%99.87%99.87%99.87%99.87%99.87%99.87%99.86%99.87%99.87%99.88%99.88%99.86%99.88%99.87%99.87%99.86%99.79%99.8%0.13%0.14%0.13%0.13%0.13%0.13%0.13%0.13%0.13%0.13%0.13%0.13%0.14%0.13%0.13%0.12%0.12%0.14%0.12%0.13%0.13%0.14%0.21%0.2%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped