European Genome-Phenome Archive

File Quality

File InformationEGAF00001688482

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

138 771 23245 475 20819 411 53911 092 6747 209 5285 177 0163 957 6843 170 1772 644 4112 258 4711 965 0211 742 5121 567 1971 428 4491 309 8451 217 8381 128 5041 050 021988 659938 046888 765845 866805 576772 943735 596703 760676 582650 653627 696604 462584 170565 451545 909527 576510 124493 609480 658466 444453 375441 972430 067420 773408 491397 954389 229376 370368 033359 104352 488343 754336 294329 380320 098312 476305 894299 845294 682288 261282 712276 354272 486265 414260 178256 565250 651246 522240 836236 570231 192227 274223 835219 354214 450212 180207 767204 643201 249197 745193 784190 299187 558184 487181 211179 068175 130171 627169 725166 743163 960159 706158 451156 479155 442151 587149 353146 301144 134142 211138 721137 285135 586134 158130 723129 558128 680126 422123 752122 177119 879118 894116 651115 034114 069112 533110 588108 246106 768104 408104 089102 081101 247100 43999 13298 04096 28095 31494 33393 09891 39790 58489 24588 80287 06585 96085 04083 41183 03782 38980 62780 08179 43678 03577 69376 07975 64274 81373 20472 88771 65270 46769 73868 63668 72967 15766 31565 58864 49164 10563 10762 73661 55661 03461 12260 14459 47858 45157 82256 82357 22956 16655 71954 77353 56153 42252 89952 46451 75251 91350 51450 33149 77649 32048 47047 98247 38147 55846 56946 28445 61845 95445 13944 71144 09344 40243 48842 92542 72442 20742 10740 89040 20239 76440 11739 30439 19139 11238 42237 89338 02337 01836 87637 12336 34835 95435 88035 52535 41534 77734 53534 23534 10333 92633 03232 71932 83232 18332 12031 54631 50131 57631 19831 11830 08130 22029 69729 75129 30029 07429 17228 52228 45228 41328 08327 68027 90727 00027 00226 75526 84526 42126 68226 06125 73125 69525 17924 82724 76524 74724 68524 12524 08923 68023 73623 34823 12423 04322 89422 50922 84022 25822 24521 96321 84921 79121 96121 62221 05620 97620 89920 93520 46820 75020 42719 93719 93519 89119 77519 48619 60219 44519 15118 90019 24818 82519 05718 61718 16518 46618 27518 28318 04917 98817 80617 76317 36117 50617 18317 04016 85216 76416 71916 72816 62416 44216 03616 17615 97415 95915 85215 75215 44415 37515 34515 33915 17815 09814 92814 83014 66414 68514 47214 51014 62114 39214 42114 26013 97513 86313 84513 91013 83913 69313 59113 33913 31913 30413 39212 80113 13212 82712 75712 58612 41412 49712 45312 35312 33312 23512 19412 12611 79411 87211 82311 74311 83311 58011 55111 40511 55811 21411 03911 28211 09511 00210 67810 85210 81210 71110 65410 60910 57810 69210 59410 53510 40510 11310 05510 12010 16910 08310 0559 8439 8949 7259 5189 5999 5199 4949 4259 3939 2739 1869 1449 2129 1249 1908 9708 8148 9498 9608 9578 9088 6978 4948 6828 6238 5398 4378 6698 5188 3918 4018 3688 3118 4068 3108 2208 2697 9878 0198 0678 0818 0487 9257 8147 6747 8427 7297 8507 6677 6137 5117 5487 6597 5447 5357 5817 4427 3797 2207 1997 0687 3257 0887 2867 2987 0887 0556 9996 8546 8146 8616 8496 7156 5956 9646 7136 6566 6286 5866 6146 4966 6376 2616 2786 2746 3466 2946 2146 2166 1316 1226 2236 1296 2176 0885 8946 0805 8705 9326 0585 9515 8495 9785 7685 9105 6655 8595 6425 7445 6565 6775 6415 4835 6365 4845 4105 3735 5365 4075 2605 4005 3535 3295 2555 1985 1235 2075 1805 1335 1254 9495 1955 0865 1275 0164 9234 9384 9374 9604 8474 9224 8674 8264 7574 5594 6704 8404 6334 8354 6634 7194 6684 6314 4424 5524 5154 3804 5674 5364 5374 4294 3334 4514 3464 3084 3624 2554 2824 2514 1834 2374 2864 2014 2834 1674 1824 1564 1934 0134 1234 1054 0364 1314 1604 0144 0053 9984 0553 7943 9353 9653 9054 0363 9683 8443 8793 8263 8373 7893 8053 7393 6603 7543 7003 6163 7143 6293 7103 6533 6423 6473 5663 5003 6623 7813 6273 6113 5033 5343 4643 4693 4863 4903 4703 4503 4403 4033 4293 3723 3813 4203 3793 3783 2843 2823 3403 2823 2073 2743 2903 2763 2573 2313 2483 1453 2733 3033 2443 2263 2403 1363 1153 2073 1323 1213 0823 2553 1343 0483 0512 9813 1243 0592 9742 9632 9663 0552 9042 9452 8742 8472 8112 9062 9362 7982 8892 8072 9212 8422 8982 7622 8352 7492 7442 7442 8002 6702 7632 6232 7102 7812 6402 7082 7212 6202 6462 5852 6132 5832 5592 5482 5352 5642 5922 6582 5972 6062 5162 5092 5462 5772 5932 4882 4012 5182 4362 4782 4722 4592 3922 4792 3922 4692 3562 3682 4122 4362 3872 3822 2112 2542 2722 3422 2662 3662 4082 3212 4632 2712 3202 3002 2222 2392 2922 2622 1642 1502 2262 1902 1992 1902 1732 1422 2032 0792 1172 1402 0692 0502 0412 1242 1122 0272 0152 0162 0712 1052 0312 0442 0082 0052 0731 9892 0852 0222 0362 0942 0461 9621 9921 9681 9652 0151 9401 9691 9931 8761 9771 9801 9731 9091 9131 9011 9381 8831 9421 9141 8871 8961 8911 8651 8561 9011 8511 8581 7631 8251 7881 8281 8361 7351 8011 8581 8251 7651 7711 7341 8091 8091 7931 7871 8351 8401 8481 7491 7761 7181 8281 8171 7541 8281 7691 7341 7321 7961 7631 7471 7501 8451 6671 5641 7121 7221 6781 6661 6831 7141 6871 7161 7001 7121 7171 6761 6581 7361 6821 7011 6311 6661 6291 7031 6601 6531 6821 6351 6631 6031 5851 5901 6771 5811 6081 6681 6441 6261 5741 6041 6051 5471 5631 5821 4941 5331 5921 5551 5811 5571 5821 5331 5121 5741 5271 5031 5191 5941 5711 5421 4601 5221 5841 4821 5431 4661 4711 4401 5521 4811 4961 4991 5461 5161 5081 5621 5261 5231 4341 4161 4451 4651 4791 4411 4621 4481 4841 4131 4091 4671 4871 4111 4191 3891 4271 3381 4231 3891 3241 4061 4101 3871 2881 3621 4121 4011 4011 4121 4271 3691 3631 3781 3601 3661 2911 2871 3511 3281 3241 3501 3191 3121 3071 2911 2691 3191 2891 3181 2961 3031 3011 2681 3341 2831 2501 3101 2401 2171 2941 2431 3081 2531 3141 1911 2421 1591 2351 2661 2101 1851 2501 1941 1981 1561 1201 2051 1571 1741 2321 2211 1341 1211 1631 1991 1941 1471 1471 2721 1501 1381 1891 162812 855100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 253 7050000000000000240 355 003000000014 818 2170000212 931 12800000828 362 4070006 811 158 49000000510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G5G5.5G6G6.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97.8 %105 762 04697.8 %2.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

97.8 %105 762 04697.8 %2.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %00 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %54 059 19350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %105 762 04697.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

18.2 %19 651 57618.2 %81.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 603 3223 369 0047 275 064100 638 846020406080100120140160180200220240Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped