European Genome-Phenome Archive

File Quality

File InformationEGAF00006164926

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

1 927 4531 204 766937 810847 306802 007791 153795 495820 204861 792901 668967 3501 037 2081 131 7221 226 2851 353 4031 497 7551 667 0971 857 9672 117 5962 405 1362 778 2893 225 7953 785 2144 480 8205 318 9476 352 7107 591 7589 104 90210 887 24312 949 23215 341 66518 104 68621 181 06024 624 46828 418 11032 494 54436 811 93441 388 77446 146 42650 943 52655 753 57160 571 65465 275 72569 742 28273 958 97777 753 69281 119 38184 051 19486 425 37088 259 91789 556 32490 186 85490 224 07789 641 90988 535 34386 931 57984 824 01782 273 93679 352 44176 080 51772 542 93668 794 23564 884 25660 962 78356 926 53452 886 55248 940 67345 108 29641 303 36137 706 99534 234 75331 012 62927 938 03125 077 39322 423 09519 980 56217 751 67215 692 38913 846 30612 179 86410 662 9159 327 5278 134 4087 089 0486 142 9075 329 1024 621 1353 994 4153 460 7472 993 8672 581 0812 236 9581 933 2391 681 0181 461 7951 275 8031 118 090984 768873 364777 713694 892623 412564 911513 243469 495433 793405 830379 696357 147337 667320 996305 096291 456279 757268 561260 773251 218241 540236 095228 533222 183215 497210 637205 140200 053195 051191 092184 687182 211176 474172 669168 072164 091160 889156 769154 528151 960148 070144 920142 638139 663137 976134 695132 378129 870126 850124 204121 410118 978116 223115 001113 612110 911108 396106 317104 066102 727101 15299 80897 86495 44693 87192 34991 02688 67487 98386 12084 50183 03680 68179 31677 32876 29274 56873 77071 46870 77569 45267 60367 33564 72363 49462 61561 16859 83358 84958 04957 00756 15054 84953 86052 34952 11650 78849 73949 19447 53147 06446 04045 57044 99243 58043 20842 29441 58941 16540 36239 96839 20738 26037 81437 27436 32536 40735 34235 52334 39634 42433 84833 20032 63432 48731 91531 46630 76930 43130 48130 09329 92929 55729 36529 17628 72928 67928 17928 12427 57727 34627 34427 15226 59226 52025 75025 82925 93225 36825 49525 32425 02724 94424 59424 53124 03224 15423 74923 94023 77823 42223 54623 18723 72123 37022 97322 82422 67522 24822 62322 04821 86322 06021 36521 09720 96420 97120 72220 59720 25720 12919 84919 97219 62719 12519 10118 81718 59718 23718 07917 73417 51116 91317 09516 81916 61216 32316 33416 01515 74915 60015 53815 43715 43815 16414 93614 38314 39914 30014 28113 79613 62513 63413 72113 30413 13512 63812 93712 68112 51012 26712 01411 77211 72111 53311 44611 32711 38611 45411 13410 76610 77910 93010 66110 65310 42110 50710 39610 17310 0079 8419 9419 6739 5809 3168 9999 2008 8568 8528 6588 8119 0168 8048 5648 6848 2858 3908 3068 3818 0908 2258 1037 8187 6767 8547 6257 5717 5747 5347 5437 6007 4657 2497 2017 3547 0547 0957 0206 9616 9286 9476 7866 7076 8816 7876 6736 6086 6496 6006 5646 4536 3476 4406 4266 2826 2606 3126 1526 1696 1666 0266 1655 9695 7285 9906 0585 7565 6735 5725 4835 4755 3015 4865 4075 2915 1435 1775 2745 0945 1845 2405 0064 9905 1285 1115 1094 9874 7594 7754 8515 0344 8084 6714 8064 8184 5754 8754 8944 8214 8384 7214 6984 7884 6694 6604 5304 4914 5064 4894 4444 3994 3354 2814 3294 2874 3434 2614 2604 4474 3634 3704 4054 2574 1884 3774 1444 2554 1634 1334 1884 2054 2714 2224 1404 0834 2144 0353 9783 9733 8843 9973 9013 8613 9053 9193 8223 7813 7163 6723 6223 7523 6393 7513 6253 6983 7603 7043 6283 5343 5233 5163 4993 4523 4803 4003 4653 5553 3803 4653 4043 3443 3733 2613 3683 4223 4023 3913 4263 4973 4813 3283 3083 4053 2713 3653 2563 4243 3343 3393 3063 2203 2913 2263 2753 1733 1443 2313 1833 2243 0613 1683 2353 1373 0753 0923 1473 0633 0873 0933 0453 0603 0832 9472 9342 9042 9103 0012 8753 0442 9062 8472 8892 8502 9072 7472 8762 8022 8812 8242 8442 7662 7512 6362 7062 6702 6202 5562 5562 6822 7022 6672 5562 6642 5302 5352 5382 5212 4922 6412 5652 5762 5732 4932 4832 4652 4452 4622 4002 4182 4082 4522 4422 4132 3352 3872 4522 4792 3742 3532 2772 2622 1952 3022 3662 1922 3442 2382 2872 3392 3372 3122 2802 1852 2742 2162 2752 1762 1572 2082 2522 1472 1902 1812 1312 2152 2742 3032 1592 2342 0932 1432 2802 1752 1662 1012 1612 1052 1472 0912 0381 9822 1062 0951 9512 0242 0541 9911 9421 9182 0242 0101 9532 0401 9291 9371 8491 9271 8321 9621 8471 7661 9181 9121 8411 9271 9041 8311 8051 8391 8001 7891 8151 8501 7731 7411 8181 7201 8081 7731 7981 8621 7471 8231 7111 7881 7821 7391 8271 7871 7961 7021 7811 7151 7301 6911 7341 7381 6081 7031 7191 6711 6711 7011 6611 5971 6701 5821 6641 6251 6431 6241 6101 5781 6601 5681 5601 6091 5451 5351 4261 5071 5421 5231 5431 5481 4801 5811 4461 5501 4271 4641 4841 5141 4561 4441 4661 4391 4601 3721 4431 4131 4461 4101 4051 4461 3991 4761 4201 3351 3801 3981 3731 3871 3511 3941 3691 4301 4271 3471 4071 4151 4001 3591 3931 3711 2791 4301 4411 3821 3991 2921 2261 3541 2591 3671 3341 2641 2481 2831 3201 2821 3711 2581 3121 2771 2681 2361 1981 2381 2871 2151 3391 3121 2531 2241 2651 2521 2591 2661 2721 1611 2571 2071 2121 2171 2511 3251 1861 2401 1481 2561 1991 1801 2271 2231 2421 1551 3151 2381 2501 2781 2491 3201 2341 1971 2211 1301 1721 1591 1421 1591 1281 1791 1661 1141 1201 1451 1491 1131 1311 1691 1841 1571 1041 1511 1661 0851 1701 1231 1201 0341 1201 0681 0861 1151 0781 0981 1271 1421 0371 0511 0371 0111 0531 0511 0131 0681 0739951 0821 1011 0931 0891 1141 1081 1071 0531 0411 0571 0801 0941 0739891 0611 0591 0349999991 0331 0149899409831 0321 0029609739511 0209571 0261 0021 0491 0499539449759859739901 0019219489739819659199509761 0159261 0001 0009689859529719039449649899299689888989359668999519739629489369269879178848879449018628778308528418428428858239268948918398658408578498108218228188528138278248508227978397978288338178147798018088818068828328507891 018 934100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

6 400 52100000000004 380 290 12900000000000007 269 338 13100000000000152 904 713 45900000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 088 121 06599.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 087 206 69099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %914 3750.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %544 903 12050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %1 067 700 84698 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

45.2 %492 169 55545.2 %54.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

40 848 413917 242564 8751 278 175864 970902 303986 7451 628 272643 7161 030 512473 001407 898543 194650 851372 252797 073507 756567 321746 2601 098 8511 277 4221 149 4071 391 0931 088 9781 742 0483 011 662173 2705 323 280255 566243 716501 690522 654252 640626 773263 750261 238407 454572 733151 237872 86813 028 538579 173496 465950 263767 0141 482 6301 341 6661 895 1773 714 662316 546493 231385 479550 547212 653397 014437 059282 2581 380 131276 403679 020990 313 576051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M900M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.9%99.92%99.92%99.92%99.92%99.92%99.92%99.91%99.92%99.91%99.92%99.92%99.92%99.91%99.91%99.91%99.92%99.89%99.91%99.91%99.92%99.65%99.8%0.08%0.1%0.08%0.08%0.08%0.08%0.08%0.08%0.09%0.08%0.09%0.08%0.08%0.08%0.09%0.09%0.09%0.08%0.11%0.09%0.09%0.08%0.35%0.2%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped