European Genome-Phenome Archive

File Quality

File InformationEGAF00008012525

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 716 6864 787 1584 480 7484 620 7415 141 8926 210 4157 829 0569 954 41812 549 17615 381 95818 356 28521 417 52524 649 96528 134 60632 296 11637 376 65243 700 64351 504 54860 822 75271 449 51582 985 49294 989 689106 912 206117 948 353127 477 222135 007 123140 174 111142 713 245142 620 461139 987 231135 027 107128 179 344119 876 559110 505 036100 527 18090 263 55880 187 34370 446 21161 319 11852 884 52045 327 66038 560 44132 602 59027 449 23423 002 00019 226 82416 033 04213 357 40911 127 5989 250 4827 715 2496 443 5235 393 4484 529 2323 823 4763 238 7602 746 3902 344 4882 014 1271 737 0171 510 6191 321 6231 166 7421 031 768923 223828 567748 006679 305616 599567 678524 231487 741454 488426 526400 414379 682358 953338 628324 016307 581292 377279 000268 987258 298247 278236 823226 553218 768209 897203 244197 161190 568184 822178 265172 755165 953161 554156 593151 876148 128143 672139 260135 944132 313127 999124 255120 890116 978113 775112 091109 491105 723102 633100 45097 38895 59593 89092 03088 57487 41684 76683 36980 95079 02177 91577 13974 63274 57272 59070 89569 82168 42467 15365 52564 01262 52561 23960 60959 27658 99356 99756 14055 29554 09353 49752 42451 85351 52050 59949 58549 05548 23647 66645 38244 67244 48843 09142 87841 67141 33040 55139 86638 41138 15837 83237 57136 94636 05435 42734 22133 59834 01433 42532 91132 72231 93631 11130 55230 60630 44429 70829 85029 07929 11628 67928 26028 06127 65527 75126 69026 25626 36025 72025 12324 90025 09324 28623 95123 09823 26022 81622 50222 32722 16821 68721 62121 50621 56421 46720 74720 60620 57520 25620 03819 79020 01319 49319 62019 36218 88018 84518 19118 03717 42117 01917 11516 93217 03916 61416 18516 07415 64015 39215 33515 39515 41615 24114 92914 93114 40214 53414 54713 92214 20013 56113 34113 23312 99213 00712 97912 59112 84512 86912 53512 08812 17012 12411 96411 64111 45111 56211 35311 23911 12110 62411 20210 77610 91410 95410 46710 2559 87010 0109 7869 8739 9089 7699 8079 6879 5879 9889 4989 3509 5709 5719 2749 2208 9109 0098 6598 6138 6158 5908 3958 5558 3388 4558 0688 3268 4218 0587 9847 9397 8487 6917 6397 5487 5327 3487 1317 4017 3527 0747 1487 2737 2877 1406 9786 8746 9576 6436 7416 7526 3906 4146 4236 5146 5396 4916 4086 0725 9806 1035 9876 0555 9465 9055 8435 5945 7375 8125 7795 6885 7085 6115 4975 4565 4545 4925 4115 3345 3475 3445 3385 4545 3335 3275 1955 1494 9224 9214 8574 8194 8314 7934 6334 6924 6494 6594 5304 5914 6684 7894 3764 5224 4504 4674 4384 3484 4304 3884 2604 1434 2894 2784 2374 2634 0904 1383 9654 2024 1234 1934 1604 0553 9283 8384 0433 9443 7363 7733 7893 6793 7203 6183 7053 5983 7903 9623 6943 8003 6633 5933 6433 5413 6693 6973 5873 4683 4243 6983 6083 5583 4423 6793 5063 3973 3983 3693 4633 4813 3093 4203 3353 4183 4533 2673 3313 2073 2633 0613 1183 2133 2173 3213 2273 3623 4243 1693 1863 1453 0212 9703 0593 0802 9253 0533 0922 9993 0013 0632 9783 0583 0492 9402 9092 9122 8463 0422 8732 9412 8742 7792 7582 9412 8102 6512 7012 5632 7642 6912 8502 6252 5952 5582 5612 4302 6092 6372 4692 5652 6032 5772 5922 5382 5312 6752 5422 4682 4752 5742 4032 3482 3392 3522 4902 4572 3202 3272 3082 4412 3372 3042 3882 3782 3832 3672 2902 3412 3062 2872 2742 2662 1992 2282 2142 2672 3282 3322 2942 1812 1982 1772 2002 0412 1202 1942 0442 0222 0012 0412 1112 0162 1302 1222 1622 0352 0952 1032 0851 9851 8821 8352 0581 9161 9801 8861 8991 8271 9921 8731 9131 9401 7911 8381 8191 9101 8831 8831 9441 8721 8581 8981 8631 8281 8101 8501 9181 8151 8301 8781 8761 7701 7571 7611 7411 7631 7511 7351 7721 8431 7231 7871 7521 7111 7111 6951 7461 7151 7501 6871 6711 5951 6411 7041 6671 7291 7081 7461 7041 6931 6601 6921 6851 7311 7551 6201 6571 5141 5831 6551 6301 6441 7141 6551 5431 7091 6261 7221 6101 6211 6371 6261 5401 4811 5561 5331 6231 6341 7211 6031 5111 5611 6101 5171 5991 4491 5041 5011 4661 3701 3941 3841 4251 4611 4081 4471 4301 4881 4521 4011 3481 3531 4051 3241 3971 4191 4161 3321 3751 3161 4591 4351 4441 4421 5081 4521 3821 4031 3381 3661 3031 3251 2521 3041 3581 3311 2831 2491 3351 3371 3081 4101 2511 4301 4211 4071 4501 3151 3421 3871 4071 2901 2911 2331 2621 2361 2481 3191 3341 3111 3571 3611 2631 3461 2391 2271 1241 2111 2301 2601 2051 1641 1441 1631 1991 1031 1371 1981 1441 2791 2261 1911 1281 2351 2181 2721 1611 1271 1571 1661 1981 0941 0811 1961 2221 1451 1661 2051 1591 2401 2111 1341 1961 1781 1491 0991 1341 2101 1501 1981 1351 1121 1991 0871 1491 1261 1021 0551 1181 0401 0311 1011 0529791 1251 1011 0611 1201 0581 0701 0751 0871 0159661 0471 0401 1201 0881 0611 1521 1111 0621 0891 0621 0289831 0711 0411 0489921 1471 1139841 0291 0331 0621 0451 0391 0501 0361 0951 0351 0131 0451 0161 0299849729759719471 0269469229479951 0491 0879851 0599801 0249719971 0201 0249469199261 0089899511 0079719561 0061 0451 0071 0551 0141 0141 0371 0371 0101 0489551 0271 0051 0291 0471 1181 1021 0671 0481 0381 0151 0871 0349649931 0239739701 0149899878578908268698968859399659161 011929942966937984967973913925925877952924838921924907905890927921881880896883892864876860827900843887873833866924881872906853894853889958876893895787887838885912876859833915900867862841835854900859850856912803822761819830791846829805864865891810818783793845801824853795800784793829803857827791837835747803752806847822803758797746791785808748762 559100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 175 78800000000002 736 594 46400000000000004 159 867 0590000000000082 791 642 12700000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %592 926 58099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %591 993 50699.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %933 0740.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %296 984 36950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.9 %575 687 77696.9 %3.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

11.6 %68 641 59511.6 %88.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

32 759 878715 974587 223850 810663 402684 727728 836977 968647 072554 206328 907288 605358 213394 307339 321525 829433 846473 956449 793537 677566 121586 287811 276588 619801 7111 166 318129 6782 440 569154 874141 895286 509246 900169 901300 484142 345130 071193 396238 28690 555388 2586 543 125334 341266 435524 306424 930711 319841 703647 7651 565 737157 423253 481191 196274 724136 167265 012237 696169 207716 068157 091347 840533 563 263051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M450M500M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.85%99.83%99.85%99.85%99.85%99.85%99.85%99.84%99.84%99.84%99.84%99.85%99.86%99.84%99.84%99.85%99.84%99.85%99.82%99.84%99.82%99.84%99.85%99.61%0.15%0.17%0.15%0.15%0.15%0.15%0.15%0.16%0.16%0.16%0.16%0.15%0.14%0.16%0.16%0.15%0.16%0.15%0.18%0.16%0.18%0.16%0.15%0.39%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped