European Genome-Phenome Archive

File Quality

File InformationEGAF00001561662

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

144 809 73438 786 50714 687 8737 783 5935 035 8943 701 6382 926 0772 435 9422 087 4991 838 1391 641 8841 483 6231 362 3761 254 3121 161 1061 085 1411 012 718950 877897 823841 696795 283755 007719 819684 650653 005623 170599 156568 731545 309525 312503 322482 434465 088447 655430 995414 810398 240385 242371 923357 524346 822334 819323 792316 162303 443294 557284 125277 021268 053261 155253 147246 147238 350231 892226 406217 936212 816205 821201 351195 673191 172185 262179 832175 654170 013166 644162 665157 292154 535150 415149 262143 887141 048136 967132 936130 639127 847123 872120 505117 593116 525113 399112 078107 903106 729103 258103 214100 04197 96496 06192 84591 78290 01688 58787 12286 62783 21681 84979 74777 52677 80775 51575 14472 66471 76169 87169 05268 54867 20165 69964 33163 30962 13861 08959 31658 38457 61956 74455 62054 28954 01153 07251 32151 35050 33749 61348 24347 09947 04846 21046 34845 08144 70443 75542 83342 32041 36940 62040 40939 38038 15438 63136 89836 60336 29935 87835 13234 80234 63534 36134 11832 96832 32332 59331 30831 18930 52630 24429 71529 46729 32429 09128 70328 02427 20026 59526 71926 71426 39525 55225 35925 28724 58324 63124 31223 96823 42522 94822 72822 28922 33022 61321 54121 12421 01120 72121 06020 64420 66320 12919 50219 20519 36219 21719 04519 00718 47118 29417 57918 10517 56517 32917 58116 78817 12516 50316 65716 37216 07815 89215 95715 27515 47115 10915 18915 02515 05415 14514 77014 39914 26114 00914 05313 91413 40513 64413 66213 74313 44213 27913 42912 84712 51212 76712 55212 48612 22612 04411 63711 66911 67711 60211 70211 45711 32911 21211 13611 09311 04710 84810 71810 56610 74310 92310 64410 22110 32210 58810 06510 07310 15410 2279 5939 6729 7399 8729 3259 3028 9638 7939 0318 9628 8258 8128 9119 0178 5688 9628 5498 3138 5888 3378 1998 4508 0968 2328 0188 0137 9137 6777 9507 6237 6527 7087 7387 3137 3117 4037 1737 2587 3387 4477 1187 1547 2157 0886 9816 7686 7966 5486 6976 8086 7806 4466 8136 4216 5606 6036 5026 4956 4076 1666 3696 3256 2006 0296 0505 9735 9825 8265 9386 0445 8885 8115 5545 8055 6545 5805 7765 6005 6375 3835 3455 3455 2575 2095 1445 3625 1544 9895 1075 0295 0444 9264 7804 9134 8734 6654 7424 7774 6104 6034 7074 8664 6314 6054 5284 4774 5284 5514 5664 4764 5044 4894 3934 3894 3194 3064 2544 3644 3524 2104 1763 9974 0014 0543 9084 0493 9403 9893 9323 8633 7953 7653 7343 8033 7463 6053 7203 9263 7173 8023 6483 5873 4333 7803 6373 6543 7563 6033 7783 4893 5343 4533 6253 3933 4233 2793 3833 3223 4463 2743 3083 2843 2673 2283 2243 1963 2583 2833 2353 1073 2253 0533 1933 0313 1323 0683 0843 0063 0843 0172 9732 9272 9022 9922 9403 0102 8872 7832 7302 8112 7522 7272 5802 5832 7312 7662 6782 7472 7692 7032 5952 5972 6972 6352 8382 7142 5392 7072 6132 5642 5642 5862 5102 5022 6202 5952 6612 5822 4412 4942 4242 3422 3182 5002 4922 5042 3832 3912 3622 2532 2452 3202 4112 2802 3502 2202 3612 3162 2352 3772 1892 2942 2782 1752 1492 1662 2072 0372 0792 1142 1672 2192 0102 0562 0062 0822 1072 1242 0472 0661 9571 9231 9351 9241 9901 9321 8671 9081 9391 8961 8481 9401 9501 8761 8191 9201 8251 7461 8421 8251 7251 7961 7871 7841 7611 7091 6861 6421 6451 7041 6761 6611 6241 7111 6871 6601 6991 7261 7201 6961 5951 6791 5501 6061 6651 7041 5381 6591 6471 6681 6251 6031 5551 5231 6031 4961 6641 5501 5311 6091 6301 6041 4151 5471 6111 4731 4431 4791 4261 4151 4861 4891 4361 5351 4561 5081 4631 4531 4791 4961 4531 5101 3961 3291 3741 3411 3791 3501 4251 3311 3921 3221 3391 3151 2691 2441 2761 2531 3221 3111 1701 2181 2341 2531 1671 2401 2711 1771 2531 2301 2971 2091 2521 2871 3581 2121 2701 2511 1701 2401 2371 2921 2891 1941 2531 2201 2911 2421 2091 0761 0761 2331 1981 2481 1971 1671 1491 1511 1951 1971 0981 0911 1641 0961 1051 0791 1351 0511 0921 0841 0471 0611 1311 0821 0141 0671 0861 0161 0521 0021 0199739531 0151 0059691 0279389689649299471 0199999729779521 025986964957962946964909880921886926833843867866886941911836871897905897850933923932816879863845821849826857874807815824797800792788819824827766847747770780845823765774825767773781727802749734713713735857808767770710708739684698775721774762689753685692724701757725689704660729651694675714685696691687642674652609692654700650639606679682663625612686625674655643635640606604616565608616659622645553695636593605642561551580614564540612553595599580573578672606591612547578563555573572530529545567541551499506497536562580524465525538496594530478527499446514551575521530519513510524532494514457473466471475531466511541491471471489492476486458478486461497485448503463494500487484456434473500448468446451445445430438466394462423461476465444404407420455425460452423434422425395419429431494417395410416429374385380394426454358404416406356348361357377415363382397396380364401378374387397352350421372329433385417379288 124100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

631 7680000000000000198 170 925000000014 139 7200000175 159 55400000532 681 3000003 644 522 98300000510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %60 773 98699.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %60 681 35299.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %92 6340.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %30 435 37550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

88 %53 561 21088 %12 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

26.4 %16 084 13426.4 %73.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

13 625 681110 28023 465211 482172 78349 76683 684135 91027 128251 11467 76158 318152 435105 78031 809189 54530 30643 61661 61093 79524 946183 00874 410101 805181 759609 95735 832706 21341 348131 28953 84179 46128 574196 92022 22543 52757 966110 94530 557259 3031 948 69074 65960 941102 548151 795132 44092 32792 468109 241225 441187 031263 774848 53047 325180 179106 16756 392345 189120 95730 21139 368 946051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.88%99.87%99.87%99.86%99.67%99.75%99.85%99.86%99.88%99.87%99.88%99.85%99.86%99.88%99.8%99.85%99.87%99.88%99.83%99.87%99.9%99.87%99.57%99.94%0.12%0.13%0.13%0.14%0.33%0.25%0.15%0.14%0.12%0.13%0.12%0.15%0.14%0.12%0.2%0.15%0.13%0.12%0.17%0.13%0.1%0.13%0.43%0.06%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped