European Genome-Phenome Archive

File Quality

File InformationEGAF00003605756

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

6 960 9416 336 4977 233 7759 356 49412 607 38816 906 19322 077 86528 290 83935 892 37045 332 44457 267 95971 756 33788 886 537107 830 367127 478 585146 401 843163 139 786175 932 077183 900 438186 673 486183 941 360176 428 811164 841 377150 272 482133 781 461116 557 85599 454 23483 326 24968 596 62255 551 81144 361 08734 932 27927 196 13520 993 86416 081 66512 252 6999 338 7097 122 7635 452 0804 214 9113 299 8042 617 6562 115 9341 738 8481 452 9221 242 8141 073 689945 052844 213758 330690 060632 676583 909540 716504 216467 828437 838410 897385 933367 580341 694326 182306 567289 819275 857258 626246 994235 003224 934214 713205 563196 795188 150180 198171 930165 404159 237153 306147 623141 558135 944131 839127 409121 901117 614113 686110 424106 818102 80297 70495 32792 18289 23186 24683 74280 72878 24876 03774 98372 07770 79968 69166 87864 13562 69060 59559 22657 38656 09856 43054 24852 89151 02349 76248 47348 24446 06645 66444 22942 41142 00341 22839 60938 08638 26337 63536 56435 80235 51233 73632 67431 90831 86330 87030 48530 18429 08828 56727 81526 93626 31926 20125 56125 43224 49924 04723 10123 09622 54322 43321 79821 10621 51020 68320 56619 98019 63319 72419 30419 11118 82418 40917 92417 57317 49317 65016 93117 09216 61416 28115 46915 73415 10615 05614 71814 91714 46214 62414 44414 15114 12813 48313 28413 27512 95013 02913 13812 25812 40012 28011 91411 85511 87911 63511 76111 69011 21211 40511 16411 01811 18511 03210 70010 42810 25210 1769 9509 5889 86310 2359 5379 7099 4049 1309 2889 2939 3129 2029 1058 7578 6158 5138 3888 3468 2658 3718 2117 8887 9838 0577 9947 9807 7788 1107 9077 7267 8087 4747 5717 3117 0597 3477 1536 7997 0076 6436 6856 8926 7546 7676 6116 6296 5166 3966 4386 5196 3686 3586 1676 3366 2496 1666 0345 8555 7705 8455 8295 7095 9505 8175 5385 5575 5325 4155 4525 4105 2055 3345 2355 3025 0555 2225 0085 3065 1775 0514 9654 9515 2675 2864 8504 9624 8294 6904 6904 7644 6044 6424 5244 5114 5694 4704 3504 3204 2304 1514 0554 1204 3894 2744 2094 0324 2704 0654 1804 0023 9523 8583 9814 0433 9403 9053 9963 8043 8103 9923 8163 8603 7143 7803 9693 7123 5683 6993 6483 5033 5663 6163 5293 5283 4303 5313 3693 5933 3653 4203 3163 5043 3673 3503 3183 3953 3283 1983 3033 3933 3183 1913 3023 1953 4793 2623 2363 1463 1643 0933 0663 0072 9752 7902 8182 7592 8282 8932 7762 8582 8872 8762 6232 7352 8332 8392 7202 7702 7772 6962 5132 5942 7582 7622 6202 6742 5672 7452 6252 6332 4442 3932 4942 4262 4552 4502 4732 4882 4232 4982 5992 5282 5202 4492 5292 4852 4272 3072 4052 3702 3172 4172 5492 2912 3912 3192 2752 2922 4792 3012 2842 3292 2022 3072 1172 2362 1692 2442 1772 1592 2272 2312 0952 2982 2742 3272 3072 2462 4232 3092 2762 3822 1052 0442 0451 9472 0122 1102 0831 9301 9401 9042 0581 9812 0502 1182 0742 1232 0291 9252 0032 0152 0812 0721 9712 0181 9151 8511 8681 9412 0751 9951 8531 8761 9381 9561 9281 8731 8571 9491 8871 9641 8811 8871 9181 8531 8161 9151 9191 8041 7751 8621 8731 7601 7141 6831 6641 6941 6571 6051 7381 7301 6471 6421 6471 6781 6751 6991 6211 7421 6371 6621 6371 6361 6471 7571 6231 6191 6081 6491 5511 5891 6111 7001 6491 5511 6971 6091 5841 4491 4921 5131 5331 5531 4741 5721 5691 6111 5011 5131 5021 5251 5261 4421 6961 5381 6151 4771 3361 4311 4511 4161 4321 3891 4531 4701 4531 4191 3861 4241 5001 4091 4021 4491 4231 4941 3971 3911 3711 2901 4351 3671 3611 4181 4111 3111 3221 3301 4231 3541 3781 4291 3141 4001 3321 3341 4771 3561 3961 3231 4651 3821 4411 3991 3081 2981 3621 3631 4161 2721 3381 2281 3641 1851 3591 3161 4301 2791 3651 3451 3131 3481 3871 2751 3201 3001 2321 2621 2571 3031 2631 3071 2571 3261 2401 2371 1881 1741 2391 2961 2421 1381 2031 1931 2101 2201 1801 2291 2971 1611 2171 2421 1641 2291 1231 1601 1941 1661 2001 1861 2151 2181 1751 2541 1681 1771 1641 1471 2051 1431 2071 1451 2421 1121 1581 1071 1571 1441 1491 1491 2031 1951 1511 1331 0749521 0861 0751 0721 0271 1351 1231 1341 0991 1011 0771 0041 0121 0881 0641 0441 0361 0331 0541 0351 0571 0201 0171 0191 0931 0291 0961 0801 0651 1001 0371 0499401 0689549701 0741 0401 0091 0221 1131 0089361 0519329679489999149239831 0481 0299481 0529599869811 013998932944840866914987926953944917884932917840857916943952901925893905847837882816839844906913902795944844861849880938921903825846873919908929986901900903885906943880919983867929899927847935867916851866842821833849827891848802853841849822800817803825813800822791794821851792759761799725747718754757796793819762788780805841792705783750724725732739767665734713722720745666821780759731731702727736707727689692708744674681696700681713766736709679682722729681694686678757679704646644715672700703644693680676691685654722688685667701754741769726721787799739758708773679740717706699659702689671685694685710704693677722700716796669655624668661695635646656696674616613620691685660653661702666667655705666641676682704694690619628653714635676653615603678639645621666670713755685669658628624656532 651100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

636 73700000000002 278 865 25600000000000003 449 609 9110000000000057 663 600 28200000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G50G55G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %419 171 50399.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %418 649 11099.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %522 3930.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %209 909 64350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %408 348 50897.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

15.8 %66 172 66715.8 %84.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

23 066 065523 276433 039604 369492 002502 474543 899638 763438 864413 265239 748202 648262 799291 146239 580388 475316 073341 996332 342399 806426 889443 731588 787419 751593 694865 22698 2451 802 934117 429107 910208 553183 288119 791230 646106 85298 184146 788180 63069 898291 2524 711 620248 120203 311387 470314 578540 203593 592523 2681 130 407124 442186 462148 715205 156106 015206 231193 537131 723534 282123 375264 301377 198 469051015202530354045505560Phred quality score50M100M150M200M250M300M350M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.86%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.89%99.88%99.88%99.88%99.87%99.88%99.86%99.87%99.85%99.87%99.86%99.63%0.11%0.14%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.11%0.12%0.12%0.12%0.13%0.12%0.14%0.13%0.15%0.13%0.14%0.37%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped