European Genome-Phenome Archive

File Quality

File InformationEGAF00002194429

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

46 653 22344 146 44521 406 77215 948 3485 653 4335 056 2891 674 0121 875 616707 668818 269385 097412 536256 038246 003180 933171 023139 224128 117112 479104 88393 35486 10980 61274 00868 97265 93762 34258 91956 19352 77750 62348 91746 63444 10542 62440 68140 12538 52937 41136 54035 32533 41032 35131 98030 89630 43729 52929 32828 70527 13727 55126 74726 35725 91824 70124 76223 61223 20123 78023 47222 14821 94221 55820 68920 75120 53220 38619 85619 76119 52019 47419 22219 01818 73518 53117 67317 54818 06917 29616 92216 73516 59416 43216 24516 08515 60815 79915 86015 57215 20815 52715 00614 90414 84014 86014 88214 46914 02914 23414 36214 15713 62213 62513 55313 50113 18712 91012 92412 94313 09812 57112 69312 45412 54512 41512 51012 32111 85412 05412 09811 49011 47011 72611 23311 53511 30411 44011 24411 12911 05811 07310 88311 02310 68210 71810 56410 50610 29410 50610 30610 21810 12010 28310 0989 8589 9909 9009 7829 4679 5229 4119 4279 6489 2939 0559 2909 2269 2129 1709 2678 8489 0098 6528 8678 8228 7468 8038 5818 5148 6268 7568 6748 6298 4348 4488 5158 6128 4388 4708 4468 3448 2908 1798 0798 2698 2238 2168 1437 8687 9908 0298 0278 0047 7747 7357 8257 9297 6117 3537 4847 4617 4317 2917 4217 3147 4557 3357 1967 2087 1217 1457 3237 1427 1247 0117 1127 0927 1107 0397 1416 9696 7976 7266 8266 6756 8036 7576 6436 8156 5696 7346 7806 7116 6706 6886 5376 4396 5256 5646 2726 5566 2366 3926 4426 3806 4356 1676 2116 3356 1516 0816 1026 2386 0036 1466 0666 2056 1096 1175 9475 9045 9285 7495 8495 8325 7655 7425 7765 7995 5915 6695 7985 7345 6815 4955 5305 5845 5375 4395 5265 3895 4625 4645 4095 3725 4645 5485 3945 4465 3385 5095 3045 3555 3775 3205 4155 3525 3315 4535 2545 2385 4245 2445 2685 0675 3335 2475 1795 0935 1695 1375 0725 1355 0035 2694 9925 0165 0094 9955 0145 0244 9054 9174 8974 9314 9224 9344 8264 9254 9034 8074 8284 7334 6324 6424 6754 8194 7124 4984 6834 7174 4334 6024 6284 4784 5034 4704 4914 4894 2874 2774 4154 4274 3974 4794 2364 4274 5544 4334 3074 4154 2084 2804 2714 3814 2654 3064 1954 2614 3694 3684 1994 2844 2804 2334 2784 0954 1764 2654 1944 0774 0904 0654 0474 1104 0294 1393 9064 0014 0584 0143 8964 1214 0824 0243 9843 9834 0073 9643 9734 0173 8703 8943 9183 9123 8883 8123 7943 9493 8973 7673 8753 8313 8573 7513 7653 6983 8593 7273 7703 6983 8163 8983 7023 8233 6823 7723 6863 7043 6293 6023 6123 6113 7023 4883 6053 5573 6263 5643 5273 6043 5083 5353 4653 5963 5473 4793 4273 4603 6403 4383 4473 4523 4623 4333 4323 3733 3873 4093 3353 3893 2513 3443 3743 2983 3193 4433 3313 3943 4183 3263 3833 3263 3353 2803 3573 3523 3393 4063 3583 2843 3383 2983 2243 3333 4353 3173 2993 2433 2173 2153 1483 3043 1413 2173 2883 1003 2483 2743 1473 1723 1603 1533 2343 0763 0863 1623 1193 1523 0763 0883 0683 0373 1723 1083 2033 0463 0823 1063 0733 0803 0603 0693 0173 0412 9812 9643 0283 0262 9252 9212 9642 9412 9222 9662 8382 8322 8472 9712 8902 8202 8262 7782 7722 8182 7592 8402 7962 9082 7572 8882 8542 9112 8472 8632 8852 8592 8522 8502 7242 7682 7562 8352 7882 6732 7552 8432 7692 6712 6352 6772 7592 7052 6222 5992 6432 7542 5652 6292 6532 6482 6462 5252 6732 7592 6142 5962 6102 5382 6482 6702 4882 4572 5412 5422 6422 4692 4792 4922 4952 4282 4822 4092 4652 3792 4352 4582 4812 5272 3562 3942 4352 4402 3912 3372 3062 4782 4122 4082 4722 4252 3982 3292 3822 3492 3782 3832 3612 3772 3622 3972 3432 3002 3562 4022 4112 3882 3082 3372 3162 2382 2962 3252 3282 3252 2862 3422 2722 2802 2382 2942 3132 2652 2252 2662 2762 2262 2472 2542 1472 2102 2552 2542 1712 1672 1242 1832 1602 1722 1662 1462 1182 1042 1852 1362 1352 1392 0632 0682 2222 1592 1342 0632 0812 0612 1642 1062 1042 0292 0452 0932 0392 0392 0032 0132 0432 0652 0462 1262 0351 9532 0022 1072 0032 1052 0102 0082 0271 9101 9131 9881 9692 0331 9691 8741 8621 9421 9431 9381 8401 8741 8711 9031 8151 9102 0061 8751 8191 7881 8231 9001 9211 8391 8221 7161 7681 8501 8241 8391 8121 8221 8591 7941 7471 8211 7761 7981 7481 7241 7201 8151 6911 6611 8201 7781 7371 7021 7481 6971 6801 6201 6881 7101 6231 7031 5891 6541 6541 6851 7331 6581 6591 7281 6151 6831 6451 5941 6581 6371 6091 6451 6361 6851 6541 6281 6041 6231 6361 5551 6331 6141 5831 5581 6171 6131 5371 5671 4761 5871 5751 5291 6021 5451 5031 5451 6241 5881 4981 5471 5321 4971 5211 4841 5431 5611 3901 4471 4931 4691 6281 5011 5201 5101 5761 4411 4911 4501 4531 4671 4221 3741 4421 4461 4851 4091 4721 4111 4351 3881 3561 3811 3501 4541 4141 3111 4321 4261 3731 3981 3451 3511 3791 3451 3661 3461 3361 3921 3261 3471 3211 3521 3151 3161 2881 3511 3021 3231 2941 3651 3571 2681 3001 3501 3551 2671 2421 3111 2931 2701 2761 3211 2661 2971 2881 3291 2911 2651 3681 2841 2251 1911 2151 2271 2601 2481 2011 2151 2301 1491 2611 2491 2201 1741 2231 1601 2141 1731 1891 2001 0681 1081 2081 1681 1851 1561 1811 2141 1661 1691 1851 2161 1211 1211 1751 1341 1781 0411 1911 1471 1421 0821 1171 0541 1211 1971 0781 1691 0961 1171 0711 0741 1701 0521 0951 0871 0751 1031 1141 0141 0261 0741 0361 0801 0401 0821 0511 1031 0271 0181 0111 0671 0641 0029721 0961 0441 0901 0451 0441 0129801 0341 0281 0099671 0599979821 0171 0191 0259819919999789649449881 010948974936976968954961995993918931268 285100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

158 585000000000000085 062 27800000004 628 364000077 751 21000000239 397 2240001 709 474 88900000510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %28 178 89799.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %28 147 24899.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %31 6490.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %14 109 81750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.2 %27 980 29699.2 %0.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

74.4 %20 982 77474.4 %25.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

679 0527 0762 30212 7292 8993 4926 3657 8113 04214 0495 1195 37710 6155 9122 88610 0863 3444 7817 2009 5916 05712 7809 58120 13330 06552 1791 836127 9272 9996 4477 0816 4081 65113 0771 9292 9874 7946 3931 78515 329270 7689 0628 20615 35212 03018 74515 48222 16023 12251 85756 87635 675177 0127 29532 14715 3126 30947 6054 8463 03126 288 475051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.89%99.89%99.9%99.89%99.89%99.88%99.89%99.87%99.89%99.89%99.89%99.89%99.87%99.88%99.89%99.9%99.86%99.88%99.89%99.89%99.91%99.79%0.1%0.11%0.11%0.11%0.1%0.11%0.11%0.12%0.11%0.13%0.11%0.11%0.11%0.11%0.13%0.12%0.11%0.1%0.14%0.12%0.11%0.11%0.09%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped