European Genome-Phenome Archive

File Quality

File InformationEGAF00008060563

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 698 3833 727 6133 095 9222 747 4112 474 6472 301 8302 171 9042 083 6552 024 6342 015 3532 047 1002 115 8292 281 7702 527 3082 886 9713 410 2514 144 1105 098 9826 318 1337 826 2689 644 42011 725 90714 038 73116 543 93019 114 01921 744 79524 303 38126 649 91728 766 40630 552 84532 028 34533 141 10333 962 37534 476 09734 832 52935 068 54235 322 69335 707 47236 289 29937 082 15538 215 68539 661 47641 460 84343 646 67046 123 94048 880 44251 845 38154 905 15358 032 19061 134 01764 121 26366 889 03069 384 72571 493 15673 201 98374 465 17275 252 91375 492 94975 221 79974 461 19273 164 88671 423 72169 243 13866 661 18563 749 77960 534 45757 169 40553 629 74449 995 94946 342 45242 727 42339 202 68535 801 04932 572 25329 528 51226 722 15624 116 77421 738 18419 584 67517 643 74015 917 88314 374 11413 018 01011 827 27410 789 5809 872 1459 064 1718 358 8547 737 2317 171 3216 671 6186 218 1405 795 9555 411 9395 056 7504 722 1594 404 4074 096 1543 809 6193 527 2083 262 3193 014 9352 777 4532 550 9792 345 6372 144 4931 959 0441 783 5421 616 6941 467 7921 327 8791 201 5211 085 955983 454886 338801 600722 770653 389592 198538 763489 566447 867409 267375 916346 514321 016298 188280 146261 229245 900233 378221 108212 323202 208194 806185 685179 139173 129167 558162 118157 133153 986150 370144 976142 200137 538135 289131 240128 196125 421123 337121 037117 274114 864112 266110 623107 888106 800105 076102 566100 23299 57097 29595 59493 97491 76989 09487 64486 08684 25782 52381 31779 10177 85575 72574 01672 47672 03170 76169 48968 73167 24666 40864 85564 02362 89361 62460 21259 37758 72057 43956 98255 91155 73654 77353 65253 11051 85350 53949 93048 72847 87046 97746 58745 63345 13544 49144 29543 44742 60341 92641 08640 76640 40639 74139 35638 51438 05537 42236 71336 29636 28235 53335 47035 71634 67933 77533 93133 20633 60133 21332 88531 99631 89531 57331 32330 86830 80330 37629 52629 59729 30028 58328 64228 31027 93227 72927 80227 59927 09826 50426 40625 80525 74925 35825 26324 66424 59824 15823 88523 49823 39622 87922 85522 71722 72322 26722 20921 70521 92521 63921 59221 48921 10820 57620 68020 48120 65820 36019 76819 80719 33319 46219 25319 17218 86018 88318 62118 32818 16718 01617 91417 43917 38017 13217 16117 33916 51916 32115 99415 90916 02515 79515 63615 35815 47315 30815 33515 01715 01814 89114 62114 51014 67514 31614 47714 51414 17514 04014 23713 59114 04713 87513 79113 65213 37013 49213 47613 20313 23813 34813 31413 08413 15712 53712 69612 70412 92512 46512 43912 39912 10312 11911 90812 10111 80011 75911 69811 51911 60111 56311 43311 38710 96011 26010 63210 42310 18910 26410 21710 34210 08310 5499 9909 7459 7929 9009 7809 4659 2239 2379 1659 1199 0908 9859 0278 7008 6158 5438 2918 3328 1327 8978 2198 2918 1037 9097 8857 9137 9337 6947 8267 8547 4277 3367 1747 3056 8966 7776 7686 8776 8976 8936 8006 7796 8576 8496 7566 5706 6486 5776 4986 5466 4046 3706 4136 6016 2676 2066 2296 0676 1946 1195 8485 9796 0225 8205 7695 7865 7395 7055 5285 6315 6025 4245 5335 7425 6345 5725 5855 5265 3145 3445 2185 2795 3655 2325 3335 1125 2125 2225 2435 1705 1895 0374 9034 8854 7054 7794 7614 8264 8844 8584 8014 7904 7754 5054 5544 7244 4834 6164 5124 3764 4274 5584 5664 4734 5274 3814 4494 4184 2634 3934 1064 2244 1984 1544 1484 0664 1524 0924 0424 0264 0424 0714 0794 0393 9493 9113 8353 7173 7413 6653 7503 6913 6793 6533 8313 8873 6203 5583 5133 5543 4693 4693 4953 6383 3943 6103 4073 4823 4013 3913 3913 4013 3453 2853 4313 2403 2653 1613 2063 2713 2193 0873 1493 1383 0293 0413 0422 9963 0552 9532 8722 9512 9102 9732 9463 0442 9352 9982 9753 0223 0803 1562 9732 9933 1482 9412 7982 9762 8582 8152 9252 7682 8642 8472 7002 8452 7412 7822 8172 8602 7962 8102 8432 7352 7832 7952 7752 9052 7212 6252 5982 6842 6802 7032 6552 6692 8692 6982 6242 6042 6172 5112 6362 6392 7412 5982 6562 6512 6052 5142 6172 5352 5702 3782 4022 4142 3712 3452 3642 4142 4052 3002 3532 2822 3472 3342 3062 3392 1752 2472 2902 2672 2302 2892 2742 3042 1642 1382 1762 2302 2002 1572 1522 0722 2532 2082 1042 2522 2122 1012 2072 0732 1672 2642 1072 0452 0612 0372 1022 0812 0982 1021 9732 0411 9272 0672 0851 9741 9592 0022 0461 9581 9521 9421 9451 9512 0531 9521 8641 9491 8741 9541 9571 9381 9181 7931 9591 9231 8401 8681 9111 9931 9412 0551 9311 8611 7331 8011 7581 6791 8401 6821 8481 7271 7071 8071 7251 7651 6881 7321 8291 8461 7341 8201 7471 7681 6691 6501 6651 7261 7211 7321 6951 7001 7551 7581 6671 6721 6011 7141 7211 7571 6591 6071 5291 5721 5401 5651 5221 6061 5731 6511 5801 6411 6891 5761 6311 5291 6761 7031 6611 6701 5571 6421 5371 5491 6591 4961 5051 3961 5141 4771 5751 4381 6381 4921 5251 6511 5581 6801 5281 4941 5411 4671 5851 5141 4711 4471 4351 5021 5521 5661 4941 4621 4691 4801 4921 5521 4791 4201 4561 3751 4271 3931 3971 3791 4851 4361 4031 4421 4121 5721 4111 4421 3371 3201 4091 3801 3591 3211 4451 3631 3311 3481 4311 3681 2961 3751 2791 2941 2861 3631 3171 3461 3351 3701 4011 3481 3601 4021 3301 3051 3251 3601 3081 3101 4501 3671 2541 2941 2461 2531 2761 2891 3441 2581 2311 2501 3421 2281 2131 2761 2591 2421 2941 2751 3181 2951 3211 2531 2821 2331 2991 2661 3561 1731 3621 2741 2201 2211 1841 1991 3121 2711 2211 2071 2521 2481 2051 2221 2481 2541 1801 1751 1601 2331 1981 1941 1691 1721 2061 1451 2111 1751 1611 0951 1811 1301 1661 2051 1701 1371 1281 1251 1271 0581 0571 1471 1301 1111 2211 1661 0911 1861 1201 0981 1111 1061 0831 0811 0539921 0771 0361 0429901 0461 1151 0281 1181 0451 1019891 0231 0581 0051 0089629669959919689699539581 0269211 0299471 0619379149321 0099869749831 0259751 0401 0241 0541 0051 0179969281 0089299679799289671 0159719949679719359549199619199649119399499419469189279169359889299939311 064908921911904908944940901956854936 688100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 559 07900000000005 565 807 65700000000000008 521 660 59000000000000152 653 941 69200000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 102 522 60899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %1 100 908 76699.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 613 8420.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %552 135 65950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.7 %1 079 352 09897.7 %2.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.2 %57 943 0565.2 %94.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

47 448 6001 081 800890 8691 344 8591 012 3291 005 1481 118 6391 407 527972 483877 764501 080428 292574 558620 662512 865826 866634 031685 562701 377819 850873 238913 4861 281 518935 9731 401 9672 004 025205 5724 205 049243 817225 697491 106388 973263 418491 013233 824216 661332 577394 496152 468635 47111 532 283596 701472 184931 425762 4811 272 7161 512 0171 198 7082 858 855275 511455 474337 253498 804246 847475 740430 197309 5521 291 985289 730643 6791 009 744 604051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.84%99.87%99.87%99.86%99.86%99.86%99.86%99.85%99.85%99.85%99.86%99.87%99.85%99.85%99.85%99.84%99.86%99.82%99.84%99.83%99.85%99.63%99.56%0.14%0.16%0.13%0.13%0.14%0.14%0.14%0.14%0.15%0.15%0.15%0.14%0.13%0.15%0.15%0.15%0.16%0.14%0.18%0.16%0.17%0.15%0.37%0.44%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped