European Genome-Phenome Archive

File Quality

File InformationEGAF00005800541

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

53 475 03623 261 51711 184 6637 108 0894 779 6343 606 1702 804 0652 296 1611 935 2181 672 0031 470 9801 311 4351 190 3841 086 961997 750925 296859 878805 776764 688722 048684 418655 021625 858602 826580 834562 202539 733520 507503 258490 335471 104459 079446 306434 008420 312409 757398 780388 233378 483370 981360 957352 201343 919337 126329 082322 409315 372310 077303 558298 737293 384286 613281 258277 354269 616265 649260 162253 822251 686247 388242 765238 244237 047233 093230 374225 254221 261218 681213 635211 744208 160204 493201 603198 470195 559192 227190 236187 365184 916181 989179 518177 710173 311171 512168 932165 736162 572160 464158 256157 039153 460151 676149 870147 060145 385143 247141 520139 333137 955136 054135 259132 704131 035128 545126 855125 607123 900122 274120 418119 373118 100115 320114 627113 047111 977110 134107 733106 524105 758104 739103 510101 466100 286100 24698 89697 18095 35095 02692 85992 07490 68589 88888 35187 67286 97185 36984 35884 09383 15682 00281 71080 17479 34778 76678 34477 15076 44675 56875 03973 79872 66072 70271 32170 09969 71969 39368 75967 45366 72866 14165 97864 65964 19063 43562 87062 38061 90061 00860 27859 51158 78758 49258 07357 10756 98856 67755 94455 33255 19254 66953 54553 42752 49352 16751 94451 35150 46250 59550 20249 78949 26049 03348 25948 09247 37946 84046 17046 40745 85845 83945 59244 98144 31544 61143 80143 01242 75042 63642 21141 89241 91541 37840 92140 99440 25239 64139 06338 92238 75538 75438 49538 01338 27137 76437 63936 80836 73236 50936 00635 99335 26034 99535 02134 12334 37334 39733 55133 71533 39132 70532 15331 78431 99731 59331 43931 32831 39530 65430 46030 53130 47230 11729 74729 54329 14228 86229 11228 76928 47328 10627 87027 78428 06027 64027 58027 23527 22526 86826 80926 60126 17726 25625 91825 70525 58925 20725 10524 66024 65925 10224 63824 21224 32424 22424 19324 17724 17623 64023 42123 37423 19123 21222 81222 50222 17322 50322 20922 38622 13222 00821 49721 50321 56121 19721 20321 07020 76121 05120 75220 75420 43220 27720 36520 55420 17619 88719 63919 79019 42619 60719 32419 33119 33919 11919 15819 00818 87518 75618 40418 47818 53318 23718 28117 80017 74517 90917 54617 42817 35617 33217 36717 30717 09117 17617 14516 99816 84216 67616 79716 56616 36016 21516 23716 37415 87615 96215 90516 01015 47015 88015 67315 62915 66315 51415 36815 45715 01215 17715 07814 86614 73214 94514 48614 47514 34514 22114 12014 22614 23813 99814 01513 93513 91114 15413 83813 81713 70513 50313 42013 24213 49013 33813 11713 19013 27113 35013 03013 01012 90812 94212 71912 79712 29512 36512 13312 44612 26512 17412 28311 96012 06611 97712 03711 74211 59311 52511 95711 69011 51011 34811 28411 43511 28211 24911 36011 23811 09610 98911 05211 03110 93611 15010 95210 71610 70910 73910 67410 81310 51210 64910 49910 28710 39410 40010 20710 12710 07310 0959 9889 8929 9959 9159 7639 8369 6019 7449 9069 4189 6639 4799 7029 3449 3759 4979 0839 2619 2749 2889 1429 1519 2948 9139 0588 9558 9418 8798 8138 6128 6978 7368 6088 5378 5158 3948 3568 4958 4188 2948 2688 1548 0488 3038 3428 1488 1967 9927 9527 8677 8217 9147 7317 9707 7887 8417 6117 7467 6207 6007 7017 4897 5057 5397 2727 4617 4287 4127 2477 1317 3517 2467 1777 1337 0987 1127 1287 0266 8876 7866 7786 9206 9366 7946 7726 6736 7606 7086 5986 7696 5746 6756 7176 6126 6486 3466 4206 5376 4966 3696 3756 3306 2426 2936 1226 3796 0756 0476 2656 1426 0266 0846 1016 0656 0665 9875 8845 9585 8735 7735 8785 7455 6945 6805 5385 6095 6765 7255 7665 6185 5415 4795 7675 5525 4375 3695 4005 4725 4435 3685 3485 4285 3345 2675 2265 3305 2045 1835 2545 3065 1925 1885 2585 2475 1104 9125 0404 9995 0175 0014 9634 8974 8345 0014 8754 9365 0194 8264 7984 8684 7694 7464 6914 7324 7114 7364 6664 7814 7204 9334 6674 6454 7384 7764 6434 6464 5404 6364 5054 5954 6384 5164 5814 3984 3924 4494 3934 4244 4534 3364 4614 3774 3184 5054 4314 2864 3274 3454 2424 2544 1594 1334 2514 3134 0504 0864 1744 0444 0394 3094 1454 0564 1604 1184 0643 9833 9813 9893 9913 8563 8163 9953 9513 8143 7973 8083 8833 7113 8173 7343 7063 7413 7293 7823 5953 7453 7263 7573 5803 7393 6873 6413 6243 6483 7333 6243 5783 6143 7143 5453 5653 5973 5203 5493 5623 5463 4743 4773 4543 4373 5713 3883 5663 5583 4953 4223 4213 4393 3663 2773 3193 3243 3523 3133 4653 3153 3953 3123 3133 3903 1953 2593 2403 2923 3103 3443 1693 1983 1423 3123 2013 2333 1753 1083 1443 1003 1633 2183 1393 1723 1533 1953 0543 0433 0143 1043 0233 0983 0443 0453 0313 0533 0552 9872 9232 9402 9063 0532 9712 9022 9592 8972 8912 7472 8912 8672 9022 8242 8782 7852 8142 8572 8382 8182 8012 8172 7422 7862 7972 7852 6402 7432 7782 8812 7102 7362 6642 7072 7302 7022 6052 7972 6212 7072 5952 6102 5002 5462 6312 6082 6142 6102 5602 4472 4642 5822 5442 4612 5652 4862 5372 4652 4782 4452 4042 3982 4612 4312 4692 4302 3782 4002 4782 3952 3732 3972 4562 3882 3602 3502 2872 3502 3702 3222 2972 2582 3762 3172 2632 3332 1952 3562 2582 3642 2172 2442 1582 2452 1022 1512 2622 2142 2222 2712 1662 1982 1512 1312 1322 2262 1002 1112 1962 1202 0632 1102 1462 0712 1422 0732 0592 1482 0792 0122 1172 0801 9852 0382 0202 0872 1532 0321 9072 0292 0881 9861 9722 0062 1251 9792 0741 9871 9731 9822 0572 0781 9911 8962 0181 9941 9861 9601 9711 9391 8651 9781 9161 9381 8241 8751 8811 8991 9011 8881 8551 8211 8741 7541 8051 8211 8631 7811 7531 8301 8211 7111 7981 7121 7871 7951 7631 8551 7561 7431 8011 7551 6811 7391 8491 7621 7651 7621 7311 6961 6891 6761 6541 7301 7361 7161 6891 7121 6981 6931 7991 6811 6511 7361 7171 6771 6891 6761 6141 6711 6761 6031 6861 5791 5871 6981 6981 5941 5751 6101 5221 6001 6071 5911 6151 5411 6571 5761 209 509100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

007 231 622000000000134 925 16500000000089 304 3640000149 273 6010000502 666 3640000951 996 4490008 104 565 68900510152025303540Phred quality score0G1G2G3G4G5G6G7G8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97 %129 523 16597 %3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94.9 %126 701 10294.9 %5.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

2.2 %2 822 0632.2 %97.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %66 734 52050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

94.2 %125 778 80494.2 %5.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

0 %00 %100 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 306 09812 182 998124 587 744051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

97.94%97.99%97.87%97.93%97.98%98.04%97.73%98%98.12%97.97%98.01%98.09%98.01%97.94%98.09%98.08%97.83%98.06%97.8%97.9%97.94%97.95%92.96%98.03%2.06%2.01%2.13%2.07%2.02%1.96%2.27%2%1.88%2.03%1.99%1.91%1.99%2.06%1.91%1.92%2.17%1.94%2.2%2.1%2.06%2.05%7.04%1.97%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped