European Genome-Phenome Archive

File Quality

File InformationEGAF00001688990

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

135 826 47445 710 36320 401 84611 972 9797 869 4435 702 1794 359 8673 486 2732 876 7302 441 0982 107 5671 853 3431 649 9621 493 8551 361 6211 250 2821 160 7141 085 8111 018 134963 514912 485866 328830 050794 525760 036730 547702 893676 727650 462625 902605 394588 466571 231552 009534 290516 613501 655486 748473 121457 421447 832436 449425 768414 818401 930393 833384 972374 803366 728359 353351 983343 411334 929327 986320 115314 837309 803302 683294 118288 834283 419278 544272 874267 375262 807257 771252 873249 140243 155239 376235 005230 723226 090223 166218 145214 718212 105208 092204 254201 569198 405193 585191 404188 142185 290182 154180 633177 714175 204171 505169 698166 205163 837161 719158 841155 738153 545151 483149 259148 410145 741143 259140 229138 233137 062134 758131 450130 281128 794127 005125 038123 344121 029119 377118 369116 843114 249113 026111 231109 379108 294106 776106 142103 924102 572101 072100 045100 00198 42296 32995 99594 28592 83891 47091 40790 21989 30887 24586 62885 28584 15383 56182 74681 54580 49779 91379 01678 40376 74676 39875 43674 97173 66872 60971 85370 47569 97669 84268 47867 97267 43166 46765 44764 62864 57363 12562 96562 27361 51760 56560 05759 63358 63357 64356 91356 74655 85454 89654 30254 11853 44153 49852 22351 78751 36550 76750 63050 11350 07749 35148 58348 07047 74547 07346 90646 46045 66945 38844 81844 33644 35743 97143 15042 73542 15941 65941 17640 59640 33739 57039 40039 50638 66838 61038 14537 71237 73036 83637 17336 38035 80835 79735 28334 92534 96434 92134 71433 68933 72133 06133 11232 61533 00832 17331 57631 53631 65431 38730 94030 32530 28930 10829 65629 20729 49528 97329 04028 87928 75328 26127 61127 58427 69227 25527 49326 68126 80626 46826 38926 25625 83325 70625 57224 94325 02525 26224 53424 46823 94323 88423 68223 77923 99123 42123 15022 81222 91522 63122 47822 38222 13122 01622 19621 66821 58621 66321 20421 03320 73620 58920 82120 39320 32820 40520 05619 81719 89719 50619 46819 32419 13919 19618 82519 00218 96218 60318 85618 42418 20917 90417 71517 74317 56417 56617 31817 23917 33516 90616 92916 62916 48516 50816 35716 28616 02016 08815 94415 57615 47715 50315 43715 33815 33115 18214 88014 84914 86914 67014 54914 36814 10714 39414 23514 12714 22113 81013 86513 73513 84913 57913 42613 29813 47612 97812 86013 05712 86812 76612 78112 56612 52512 38312 34312 25612 05611 88812 05011 91811 74511 64111 67911 66311 50011 51811 33611 44111 12111 43411 05511 20011 04310 98411 12410 80210 81510 61310 89010 62010 41410 31010 31410 24610 12310 20010 04810 0259 9809 7119 9739 8869 7049 7119 8329 6299 6199 3809 2829 4569 2149 4539 3029 3179 3808 9069 1828 8848 9858 9138 7658 6538 6288 5408 7978 6438 4208 3328 4778 4348 5358 2888 5068 2148 3748 2258 1808 0868 0668 1228 0357 8777 8037 9247 9697 6567 8157 8077 6077 7397 5727 5427 4997 6207 5407 3107 1447 2137 1407 2377 3417 2607 0607 1137 0497 0877 0746 8406 9846 8996 7366 6786 7726 6966 5006 7346 4316 4386 4406 4946 4886 4756 4046 2916 3656 1766 3016 3596 0826 3685 9496 2416 1336 0966 1196 1146 1226 0816 0336 0525 8455 9745 8145 9435 8235 7345 7215 7965 7435 8065 8565 6725 7405 5005 5995 5825 4245 5235 4905 3315 4015 3645 4105 3245 4565 3065 3385 2565 1165 1965 2085 2215 1475 1705 1505 0495 0104 9244 9794 8774 9304 8184 7974 8494 9054 7314 7744 8864 7554 7994 7184 5454 6924 7264 5344 5834 5944 5154 4204 3804 5384 3824 3534 2834 3094 3914 3454 2184 4524 4614 2584 2834 3144 4864 3384 2404 1644 3514 3294 2034 0554 2374 1924 1704 1244 0804 1614 1364 0163 9753 9083 9203 9834 0373 8673 9614 0213 9583 8133 8823 7603 7003 8323 7013 6703 7093 7023 6883 7593 6643 5923 7543 4913 7103 4943 5043 5043 5663 4243 4133 5093 4473 3973 4423 3813 3383 4023 2693 3663 2283 3073 2703 3523 3593 3463 3693 3173 3263 2783 2783 2803 3283 2083 1813 2173 1303 0923 1803 0963 1493 1333 1333 0932 9443 1373 0863 0963 0242 9312 9153 0032 9382 8832 9932 9692 9312 9292 9892 9052 8952 8122 8942 8832 8172 8342 8202 7602 7602 8552 7992 7532 6442 7742 7102 7342 7322 6762 7672 7522 7632 7112 5802 6292 5332 5692 7332 7812 5812 5382 6312 5522 5502 4272 5292 5322 4542 5762 5052 5322 4672 5162 6222 3562 3842 4992 5662 3572 3952 4062 4242 3952 2352 4062 3212 2902 3612 3092 3492 3902 3522 2572 4092 3422 2922 2132 2682 3522 3362 2892 2802 2392 2242 2702 2702 3092 2942 1782 3032 2042 2332 2062 1502 1662 1762 2072 1052 1752 1822 1402 1322 1142 0612 1312 1532 1112 0592 0351 9882 1002 0912 0762 0372 0381 9891 9502 0202 0452 0671 9642 0172 0172 0072 0051 9851 9901 9071 9411 9631 9611 9451 9551 8771 9851 9451 9511 9261 9291 9221 9231 9781 8411 9661 8571 8671 8951 8621 8491 9351 8691 8361 8151 8161 7701 9011 8331 7861 9011 8851 8931 8521 8361 8431 8481 8291 8701 8301 9371 7851 8371 8501 7581 7621 7931 7651 7781 7031 7561 7841 7341 6791 7271 7191 6931 6901 7231 7201 7671 7141 6381 6111 6441 6771 5461 6431 6461 6031 6091 6491 6391 6391 5531 6121 5351 5571 6151 5721 5771 5171 5791 6131 5271 5331 5241 5881 5951 4841 5291 4921 5711 5531 5051 5991 5411 4931 5471 5631 5221 4841 5181 5971 5371 4861 5301 4491 4741 4201 4601 4851 4221 4361 3841 4021 4031 4351 5621 4391 5061 4351 3801 4311 4041 4261 3841 4331 3991 3841 3611 3671 3941 3821 3981 3281 4281 3301 3801 3531 3431 3901 3331 3131 2921 3031 2841 2971 3861 2841 3291 3601 2901 2531 3291 3261 2601 2921 2331 2831 2941 2881 2541 3031 3281 2131 2651 2551 2851 2891 2091 2991 2021 2611 2961 2511 2971 1691 2231 1951 2581 2461 1911 2331 3151 1931 2061 2731 1761 1891 1951 1841 2161 1901 2501 1851 2311 2331 0991 1481 2121 1371 1641 1991 1541 1551 1681 1971 1391 0731 1201 1691 1161 1401 1221 103768 869100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

7 371 3220000000000000268 413 125000000016 815 1510000235 477 50800000868 238 6510006 897 401 59300000510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G5G5.5G6G6.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97.5 %107 771 71497.5 %2.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

97.5 %107 771 71497.5 %2.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %00 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %55 291 44950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %107 771 71497.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

16.8 %18 616 95416.8 %83.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 729 0823 315 2386 761 500102 744 950020406080100120140160180200220240Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped