European Genome-Phenome Archive

File Quality

File InformationEGAF00000690246

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 479 8161 956 6151 826 8341 863 9412 059 2562 417 9762 986 2173 794 5544 924 2056 390 8108 238 18210 425 83412 952 32415 756 60218 771 09021 972 80625 340 63828 865 73532 561 37836 502 66840 692 85545 303 14350 290 06655 775 55161 593 75967 839 86474 378 43481 065 20187 661 83293 997 76599 854 695105 100 283109 455 773112 789 089114 958 167115 832 404115 375 548113 513 787110 312 224105 903 260100 422 19294 009 45386 920 58279 306 01071 421 48663 545 46555 820 02348 385 11141 435 64935 021 82529 264 96324 156 53619 722 34915 905 22212 699 38010 053 2577 877 3216 136 1274 756 3243 673 4492 842 9372 207 1151 726 3761 366 8281 104 793913 016768 010666 232590 846533 829489 946457 213433 405413 566393 386377 184364 528351 072339 179330 049319 042308 811297 555289 632280 163271 190260 014251 065243 368234 811226 318217 960210 349203 201195 819189 112182 295177 376172 949166 255162 748157 272153 578148 384145 723141 537138 690135 330131 798130 066126 333123 996121 402119 433117 062115 592112 819110 296108 232107 013105 420102 23999 97898 40496 16194 86193 17789 71488 97587 00685 80182 57482 05079 88378 73077 12375 28573 74772 35670 17368 94967 20765 94264 37563 97361 86560 90758 94557 74856 58855 90154 03952 95251 81650 57649 31648 16147 25045 97945 54744 78543 92942 55042 10041 56240 82239 58139 08038 27337 39736 71336 21335 67634 36334 08633 62032 37232 36432 08531 37430 57230 17429 67929 35328 45728 26927 59527 23626 60126 14225 54625 06024 98624 57924 23623 97023 91123 47822 76422 53322 02021 82121 17420 84320 66620 34119 81319 79219 17019 02618 89318 69518 35018 34118 17917 69617 42817 08616 91716 80016 49516 43516 02015 92315 60715 20415 25814 98014 80914 67814 74414 75014 51914 20614 04714 16713 83813 49413 40913 04912 73512 67512 65812 49012 20012 24712 11611 94412 10611 41311 52511 37111 14011 07711 14010 60510 62210 39510 37210 32310 13510 20210 10710 1249 9399 9189 7479 2479 3429 4079 1049 1438 9178 8828 7898 6098 4788 7098 5288 3088 5338 2128 2737 8917 8597 7887 9697 7647 6637 4507 5487 1487 2297 2857 0786 9226 9306 8686 8556 7736 6716 6136 5886 4996 5336 5226 2816 1946 1886 2336 0705 9866 0496 0896 0035 9836 0096 0665 8955 7785 8005 7635 6385 7215 5515 5125 5365 3915 3645 2595 2085 2405 1775 1635 1575 1435 0535 0855 1365 2215 0614 8564 9134 6964 8194 9564 8504 7134 7614 6304 6944 7034 6224 5564 6154 3604 4004 4144 4284 3134 1984 1604 0994 1414 0444 0014 1164 1163 8913 9543 9013 7403 7223 7223 6623 5883 7103 6663 6243 5703 5423 5703 6013 4593 4723 3323 3253 3473 2903 2723 2853 1303 0873 2543 1203 1353 1653 0493 0773 0693 0232 9062 9752 9922 8742 9952 9582 8902 8282 7312 7902 8252 7542 7802 8422 7902 7452 8262 7592 7092 6252 7072 5852 7092 5912 5762 6032 5482 5972 5672 5002 5342 5402 5132 5192 4122 5102 5182 4242 3562 3122 3602 3622 2752 2202 3092 3712 2142 1962 1872 0912 1912 1072 1132 1152 0502 0662 0682 0441 9812 0021 9992 0702 0772 0151 9941 8631 8961 8451 9861 9501 9421 9011 8831 9021 8931 8671 8621 8081 7901 8101 7821 8651 7401 7301 5751 7011 6811 7231 7041 6401 6201 6221 6751 6211 6311 6221 5851 5231 5701 4931 4811 5911 5581 5261 5491 5701 5771 5211 5951 5011 5201 5881 5201 5261 4841 4281 4401 4411 4821 4691 4611 4571 4381 3981 4081 3991 3261 3471 3931 3411 3471 4411 4061 4111 3421 3281 3611 3981 2411 3721 4241 3021 3581 3431 3261 2241 3241 3621 3071 2621 2201 2851 2711 2601 2211 1571 1861 1061 1081 1911 1991 1411 1721 1131 0821 1361 1411 1521 0491 0851 1641 2381 2021 1691 1321 2021 2261 1601 2091 0921 1061 1321 1491 1371 0251 1321 0831 1041 0711 0821 0469961 0661 1141 0971 0931 0181 0191 0089941 0731 0421 0001 0411 0081 0209921 0099331 0319759751 0169409799989538951 0699811 0091 0379789841 0351 0081 0331 000948987939930915967955892972997948920847924897935994907924823960914901821802812840810874880877836852896832826837784781795817809869840822784784858774799778754794732786768757795822773792796776821797782724665708676686685671728672742668735720683706648651665627689645666626659664658654597673670662623628650631616608610602624643620584613631679651604552640620569602616630617650660670588595612576597616590561655586527577571570595557628562553540587569620606585569588559549553522506538517479491481528518548575554486505504486445503470458465459524526481490454463498472440495555458517487452468486476442472466447489455447465453461450533446482487528506488479482491451471454489445497447456437450463437431444441469438447502456445459457442432488465457446420417475479474413437501456425477454417432408398461441459462395461386408496470452434493438449434454407430431395434378434384386362426423434371398379411358333368343315358345334355362345346328363341334367344362353330366329308327350371340330352319373326337373326355341365366376364379383360351351401341382394343371341344335348393396381385361379346340343295350328359322309335369356331505 012100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

442 404 972000000000000387 517 993908 776 1552 320 924 2092 723 933 8820132 987 1740087 948 01819 539 653362 882 268222 315 5861 050 612 041436 741 641779 553 251859 903 4131 500 349 561193 854 3971 606 009 125499 566 1993 061 092 2274 891 454 4362 344 072 4953 877 812 5626 225 085 80475 062 144 6880000510152025303540Phred quality score0G10G20G30G40G50G60G70G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

94.9 %834 714 56394.9 %5.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

93.2 %819 718 63693.2 %6.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.8 %14 995 9271.8 %98.2 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %439 989 92750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

91.3 %803 690 75091.3 %8.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

2.4 %20 817 8862.4 %97.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

86 033 1943 287 217282 4137 643 517106 37986 39467 16727 56621 29126 2787 293 00817 14516 60721 00120 65325 13429 28446 56053 782106 738110 25316 71227 32120 17517 96418 43218 38630 55630 85658 28845 38958 47674 699105 579109 182176 039198 502432 055341 912834 991184 103260 5812 900 33945 03232 19423 58340 49827 40017 24519 30020 16119 74819 16819 54119 14422 38523 11625 78531 78134 151768 257 504051015202530354045505560Phred quality score100M200M300M400M500M600M700M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

98.66%98.27%98.79%98.3%98.75%98.85%98.17%98.09%98.22%97.18%98.02%98.46%98.93%98.57%98.72%97.85%98.28%97.34%97.32%98.15%97.01%97.97%92.01%99%1.34%1.73%1.21%1.7%1.25%1.15%1.83%1.91%1.78%2.82%1.98%1.54%1.07%1.43%1.28%2.15%1.72%2.66%2.68%1.85%2.99%2.03%7.99%1%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped