European Genome-Phenome Archive

File Quality

File InformationEGAF00003609724

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

430 214 499267 377 768162 323 103118 707 33186 796 74768 993 29654 172 86443 735 73334 951 85328 229 83422 649 78518 314 47914 793 99912 004 5319 804 8698 077 3996 728 9745 667 5444 829 9204 178 8513 650 2013 252 3792 918 1252 676 4062 461 7442 293 9642 150 6382 035 2681 930 8621 853 7231 784 6601 722 2121 668 4291 613 1081 566 3951 522 8221 483 2761 449 7071 415 8971 377 9761 351 3801 331 0341 301 0501 275 3121 251 0911 229 3701 202 9551 188 0401 167 1681 145 7161 121 8781 104 7991 087 6921 069 6511 049 8721 035 5401 018 6071 008 872990 128975 516955 534940 866927 523911 720897 831887 179871 982858 930850 271836 901821 807812 234802 442789 802779 783767 885758 233748 470737 930729 751721 191712 040701 304692 796684 570676 301665 491657 132647 903639 906631 048624 938618 371607 656600 138592 575586 628575 640570 747566 145555 915551 560544 228538 806532 267527 440521 478512 811508 753504 115500 490494 425487 243483 108477 363471 117466 602462 264456 231452 768448 280442 579435 012432 736426 729423 437416 263413 637408 155405 125400 193397 171391 911387 417382 747378 068375 110370 820367 691362 646360 529356 327353 300347 840344 208341 754335 960333 276329 541326 083324 067321 476315 739310 890309 801305 435303 106298 894294 906294 117289 090286 949284 671279 889276 121274 484270 205267 832265 955261 060259 017255 532253 440250 453248 594245 159242 191240 642238 731235 440232 884230 678227 541224 616223 141219 451217 485215 905212 402209 709207 272205 117202 693200 692198 206196 527193 609192 030189 796187 411185 942183 781181 291178 834176 902175 163174 019172 137170 070167 810164 790163 730162 567159 564158 361156 500154 233152 606151 440149 336148 093146 336144 427141 923140 706139 050138 829137 086135 482133 755133 488131 152128 962126 699126 837125 130124 680122 361121 521119 768118 234116 807114 979113 699113 454110 899111 466110 222108 428106 806106 023104 408103 765102 464100 89399 91798 96197 69096 65895 65194 55193 62792 70391 11090 76389 16888 03787 03786 21385 47584 33683 12082 23580 85579 63779 87578 03777 53076 14275 76875 46373 72573 16571 76371 14570 70370 12469 37468 72468 32367 26366 19265 75564 69964 16563 57162 60661 51360 78460 21859 28158 24858 11557 70456 36656 20355 18454 86353 52853 15352 39852 62151 35151 62650 45249 87549 33148 40647 77546 93146 70546 15145 51045 19944 22243 96243 61743 20642 81942 39541 78241 48440 07239 98539 15439 08838 81937 89637 55837 39436 42435 94836 17134 93634 85734 28833 84833 71433 27332 96232 48531 43631 65230 89130 40530 20530 02829 27828 64228 61727 54727 79627 09227 30826 87626 37226 21825 53225 59425 24824 48324 84423 92823 79023 58723 35523 06823 10722 45922 56821 99421 86021 45520 80420 74320 54020 13720 07819 70919 27719 40918 75918 60218 53218 30118 01617 88217 41017 19816 99516 76316 53216 63516 12215 91815 72015 19115 20315 38814 75414 66514 36614 36714 19014 04113 88213 64513 34013 28812 89912 92112 60812 48212 40912 26011 95011 96411 79211 61911 42211 38410 91511 11011 02610 82210 41210 45210 60210 29710 1169 9089 9229 5759 3599 2569 2918 9708 9208 8668 5798 5498 5088 2438 2497 9167 8897 9007 7447 5177 5017 4357 1587 1057 1546 9846 7456 9816 7626 4296 4526 4876 2616 2646 1386 0415 8825 8445 7655 8185 6435 4315 3935 3365 1025 3025 3865 2434 9664 9784 9554 8384 7974 7704 5654 5634 5904 4614 3754 3624 2464 1624 2674 1494 2654 0233 9803 9524 0053 8273 8303 7063 6303 6253 5333 4823 4483 5503 2643 2993 2713 2273 1483 0863 1122 9923 0932 9982 9982 8562 9132 8052 7892 8442 7262 6622 6192 6072 5972 5442 6552 4812 4842 3732 5002 5642 4552 4362 3482 3412 3992 2732 2822 3072 2112 3292 2232 1082 0992 0812 0892 0442 1771 9991 9942 0401 9531 9371 9551 8551 8771 7851 8841 8471 7941 7611 8331 6831 5691 7041 6301 6611 5411 6631 4921 5361 5341 5091 4941 5431 5261 4591 4351 4301 4531 3721 3671 3601 4291 3491 3481 3071 3351 2901 2811 3271 2161 2401 1951 2061 2081 2391 0311 1171 1151 1721 1671 1451 1301 0191 0391 0771 0551 0201 0071 0251 0209891 01391696293898692194690088189884889389085284482080681181481373371273273073470071268569469367365868458361857662961461965166566269467262057860059658862859958360361763659658556857256053858954256352853954551047848047150249848548847444743845945543644239141039141041338736837136338839135336138337738232836332632633432031032231533531028827830730729329129930529927525225830125533232330432829931030225928525430031026925828123927228524823725825724023524627224124725623825525623023224021525020923022026021322721322618820021222521319519219819519120519120221119321719325326918523424120421320820420618819323519419120919821819618329921920019817620219821820422520526821722018635516822019118218922320019518819015417415916915418420116014918017818618216818519716616718416816618516217217317818420317818518516715317017616817520517115517816820915317518220819020317716315320219218619019718418217218915715017617216015118118217516117614217316415212714314213513612613713514315313913411312315513414414516322215813913212914113412514315914012113812212014010810311711613611615415111510615114115614612412016213414112113011912510314312810810810313211146 576100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

182 1740000000000586 982 5790000000000000935 129 9910000000000018 119 544 56200000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %129 915 54999.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %129 785 86899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %129 6810.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %65 039 20350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99 %128 714 54299 %1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

42.6 %55 412 86642.6 %57.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 621 32125 93215 33046 59618 60520 79428 06654 37827 16144 17913 07211 24318 77419 2629 88337 07212 81814 64219 74222 95921 78043 73737 24531 99858 364133 3605 925598 7637 7466 96524 21017 0649 07535 9927 5957 93615 05220 7775 32254 311785 24733 11723 69152 88732 48190 13287 082200 496440 20722 73841 31533 96453 31618 44733 23933 59624 402156 06423 69458 252124 029 940051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.9%99.91%99.91%99.91%99.91%99.9%99.91%99.9%99.9%99.9%99.91%99.92%99.91%99.9%99.89%99.89%99.91%99.88%99.89%99.9%99.89%99.92%99.7%0.1%0.1%0.09%0.09%0.09%0.09%0.1%0.09%0.1%0.1%0.1%0.09%0.08%0.09%0.1%0.11%0.11%0.09%0.12%0.11%0.1%0.11%0.08%0.3%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped