European Genome-Phenome Archive

File Quality

File InformationEGAF00005283007

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

447 068 634254 295 549106 755 23766 172 93935 077 66322 288 11913 851 7599 571 5036 883 5085 305 5164 268 1333 573 8163 089 5292 730 5652 451 0542 237 3092 069 8841 919 1561 801 9471 698 3561 607 8441 533 7141 461 1271 398 9601 338 4691 285 2641 234 2031 192 6531 157 8451 122 4981 085 0251 053 3101 020 405994 956974 071948 076926 844905 959884 531867 737849 319833 226818 867806 389791 782775 065759 201751 518741 493728 723717 923705 462695 705685 592675 440665 370658 318649 921639 050635 848624 894618 906613 433605 523599 871591 510584 259577 554571 971566 790561 099553 663547 035541 328536 897530 079527 162520 123517 042510 741505 881501 028494 103488 540482 926479 387474 723469 767466 344460 841457 103452 711445 725443 553439 517433 590427 784425 223420 029417 462413 543410 983405 711401 737398 776393 740389 385386 822382 578379 839376 219371 829369 553365 846361 891358 275354 867350 754347 821343 426340 264337 673334 091331 162327 695324 625323 036319 148316 351313 271309 685308 436305 055302 547299 615296 148293 187288 730287 277284 538280 630279 606276 658273 035269 841267 159264 919263 167260 447257 391253 953252 877250 806247 550246 251243 240241 667239 042237 051234 392232 123230 222228 453225 850223 945221 534219 684217 391215 353214 511212 693210 882209 048207 153205 451203 077201 518199 015196 592194 855192 784189 226187 601185 624184 196182 220180 894178 529177 438175 194173 820171 515170 020168 206166 530165 372164 497163 331160 883159 525158 956156 784155 601153 872152 321151 803149 105147 302146 149144 503143 270142 061140 024138 911138 079135 987134 747133 741133 046130 760129 750128 677127 401126 377125 009125 158122 913122 160120 789119 385118 394116 432115 656113 948112 759111 992111 019110 827108 697107 498106 331104 356103 557103 265102 364101 04999 64398 83198 07396 56095 86394 74193 58093 12992 05090 91689 84889 22888 18287 09885 86985 54284 30482 72082 10181 03881 41679 90778 89077 95077 45176 40275 38774 36174 14473 58272 50572 17770 99370 16569 24068 30867 60666 84366 44865 86065 00864 21463 08862 75661 70861 32361 14460 36859 02758 67557 78957 43356 71055 98054 90654 36454 29753 79953 00252 48852 70151 82750 66249 91249 49348 66947 76547 42047 49546 41046 50645 47444 79344 67043 98543 72443 45042 49842 22541 64341 37441 06940 37839 85139 44738 96338 57138 22638 12637 38736 69436 76536 34235 69735 37835 02934 30333 88934 12533 49133 01432 24232 41631 55931 30030 48430 53730 02430 22029 98129 20329 09428 93428 35128 17227 86727 53227 09526 81226 18326 04425 93425 29324 82524 94324 49124 09223 59323 35923 22123 20222 56322 34121 87021 83821 50221 23021 19720 86620 25720 11019 75719 66519 61219 20018 98018 64718 31317 95617 89617 57017 41317 08017 05117 05616 63416 56516 46216 00216 11915 76515 73415 32515 25315 06914 86614 43714 25813 79813 97413 76913 56113 65113 26713 38213 09112 91812 78812 82412 41112 30912 21011 94211 82311 80311 63811 42911 40311 22910 85510 98410 72410 71210 37510 39710 32210 07410 1269 6999 5459 4239 4109 4649 1639 0178 9838 8698 8548 7198 5448 4638 2308 1778 0388 0207 9137 8227 6847 5977 3647 5727 3377 2307 2687 1027 1097 0086 9456 7776 5436 6546 3206 4036 4496 3116 0636 0416 1575 8785 9985 7445 8135 6395 5215 3925 4455 2115 1734 9944 9344 8764 7184 7984 7824 4004 4864 5534 4654 3554 3344 2824 2784 2904 1683 9904 0614 0663 9734 0314 0303 8873 9003 8043 7543 6893 6383 5633 4903 6063 5373 5873 3703 3403 3293 2533 3803 1593 1533 1153 0363 0062 9922 8782 9442 8742 7912 8262 6942 7382 7202 6622 6492 6202 5192 5642 6222 3702 4282 3482 3832 3582 3752 3572 3832 2862 2562 2212 1932 1072 1602 1102 0442 1022 0962 0602 0902 0261 9742 0081 9781 8611 9001 8931 8941 7461 8071 7611 7361 7551 7561 7071 6991 6421 5971 5941 5701 5811 5891 5601 5961 5521 4921 4931 4691 4581 4761 4421 4141 4081 3971 3861 4201 3671 3531 3841 3491 3071 2821 2661 2851 3091 2491 2451 1751 2021 2151 1761 1131 0771 1451 1221 0701 0871 0181 0401 1031 0101 0671 0171 0769921 0049221 016956920965950891872904842859852829797842818834795818797726761753757727793724757713719670720693673697704677692645671657614654606619666610620612576622608591629564586581535497543470545480513490529509450483438451483461494428415422409433375359385419393363383374408358347394436373345379391376367367340374356369341371327336323320330308352326340305294283273260280288262254262245268247226242255239237253215231219190212230230197224215215218227201218228236219220237189189218198197210214203202200213201197212165201186173188213188198178196160181178176190161167169157190166171139152148140168127142155134154145130157141150134159166142146136144138139134123123140127123135148136139105132140107135116132144129108141143127110104110118112134991141271141141061041191001011229488937610179104868773797596846663837410274848686818295718586105968591104114969599811081017410589100838790829097658979838195849373918880958274928378808369847070738487807099781058284629280769188718662719287881036781827384797369677622 925100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0022 412 2750000000093 241 23171 262 810116 111 653201 065 809155 001 61071 023 656032 043 77516 951 28246 625 45931 904 41850 654 17660 684 70918 293 176100 223 28058 377 77458 590 523104 032 324185 864 98249 483 193187 771 04461 370 455455 786 194659 652 669203 566 293465 670 716705 514 5129 317 646 5700000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %107 767 83899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %107 592 62499.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %175 2140.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %53 971 53450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %106 673 20298.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.1 %9 773 7159.1 %90.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 628 21166 63827 23998 74029 22127 101120 16738 44225 87372 11016 62414 92949 84821 4808 36136 90510 87111 29826 76019 04714 47937 55625 67728 66541 05193 1096 485498 1757 7997 17913 95012 8776 87935 3257 91610 37812 02816 5905 45644 816417 50015 08924 67219 21935 19741 19461 82565 380170 943467 05117 97843 30012 93847 15615 92116 03810 99390 58125 89546 688102 832 764051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.86%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.14%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped