European Genome-Phenome Archive

File Quality

File InformationEGAF00002467210

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

180 314 010140 092 21083 278 59853 347 20826 774 71918 608 3269 088 4077 247 4773 721 1633 220 5241 807 5901 634 2291 023 779927 147646 498570 265444 405389 915323 217288 408248 067218 034194 591180 286158 818148 620134 127125 584115 317107 609102 00496 03489 58384 25780 11277 99472 42069 04066 46963 11760 53556 58556 15453 62951 32648 75547 33845 45544 50943 15942 73841 60039 66738 77938 20137 38536 20535 42734 24934 70533 04832 35032 10530 96931 40529 79829 47029 12028 34327 89226 86527 01325 91126 28725 66525 27725 12525 32524 35924 36924 04523 91622 75423 08423 03422 74722 70922 32922 08621 03520 92020 93220 89920 37820 68120 04319 54519 64618 95619 17518 82618 54218 70718 72717 81318 04117 55716 91717 18316 80316 59917 12316 84816 66816 20016 24315 89216 04515 64315 59015 24415 16915 28214 89614 87815 15414 83014 98414 32214 34014 35314 61814 40314 70714 15214 01614 00913 96813 81114 00313 72613 48913 31713 56813 25313 25712 98712 97012 91413 06512 62512 77712 48512 70512 52012 31412 30712 13512 29412 46011 99012 42512 14311 93611 91711 67811 74211 52011 59911 61211 35011 35511 30911 22511 44511 11511 58211 00611 06410 91411 21810 93010 32910 35010 71210 46210 36910 18210 24810 43010 48910 03610 18110 19010 21110 04910 11110 0749 90810 0069 9359 94110 2689 7959 5419 8919 7239 7049 5589 7219 7329 7269 6659 5149 5479 7259 6339 6669 4149 2739 3009 0679 1118 9148 8549 0599 1238 8718 9159 0439 0609 1288 7389 2109 0458 8429 0129 0548 9778 8038 5028 6968 6178 3358 5728 6678 4998 5028 5938 3728 2598 4148 3368 6808 2638 5278 3958 4998 0658 1628 4478 1338 0558 3298 5038 0898 3658 0538 3298 1428 0097 9858 0697 8637 9197 8007 8608 2277 9207 8987 7637 7127 7937 6777 8807 7577 9467 7228 0097 8597 8107 7327 8467 6897 8127 8097 6157 7847 4677 4987 5817 1887 5757 3887 3697 5487 2197 3917 4837 2267 3947 3297 3427 0737 2177 1387 2137 2337 2117 2927 3217 1017 1337 2167 2917 0067 1347 1447 1267 1346 9827 1597 0826 8366 9986 8926 8446 6796 7966 8666 7926 7566 7356 8426 7376 6636 6586 5676 7586 7636 6226 5436 6646 6846 7426 5866 8226 5836 5966 4836 6156 5386 6366 5186 3596 6396 5806 4906 5106 6916 2736 4676 4326 4416 5856 4286 4626 5086 2556 5776 3146 3626 2936 4806 2436 3076 2466 1766 3386 4956 1036 1856 0986 1205 9726 0986 0706 2376 0296 0716 2256 1396 1706 2296 0886 0615 9225 9655 9595 8996 0065 9896 0315 7945 9976 1085 9045 8305 8645 8985 7415 7725 9695 7385 7835 9115 7455 8075 7705 7305 8945 7365 7875 7035 4945 7165 6585 5745 5605 6225 8115 6385 5965 6315 7125 4805 4675 4685 7785 5805 5465 5695 5635 5805 6395 4365 5265 4335 4265 3915 4265 3545 4595 4765 3555 5915 5075 5145 3935 4745 5345 4125 4535 4275 2505 2755 3845 3955 2795 3365 4425 2425 2655 1875 1124 9705 3715 3295 2045 2525 2025 2115 3105 2165 1565 3675 1815 0855 1905 1055 0445 0565 1374 9935 1334 9805 2275 1054 9335 0734 9915 0455 0355 1105 0755 0985 1165 0195 0564 8595 1175 0044 9354 9884 9574 9774 9775 0845 0064 9164 8484 8204 8454 9824 8094 8114 6424 7884 8164 8804 9504 9934 9364 8004 7904 8254 8774 7094 7654 8594 8734 9094 9234 8544 8744 6614 8024 7684 7764 7704 9214 7334 6984 5354 7634 6714 7434 6584 6134 7164 6034 6594 6784 6384 5924 6814 6764 5344 7744 6624 6214 6384 6924 8064 7464 6884 4894 6524 5714 4784 4884 5094 5014 5504 5904 3834 5624 4224 3834 4354 4544 4374 5444 3384 5104 6144 3614 4414 3504 3434 4214 4654 3584 3014 4124 3804 4264 3674 5064 3424 4454 2874 4124 4164 3084 3714 4344 1314 2744 3524 2694 3904 4164 2704 2654 3314 2904 3634 2244 2544 1474 2394 2974 1874 2364 2114 3134 0904 2314 0674 2574 2474 1344 1254 0704 1154 0754 1454 2354 0564 1164 1544 1264 0643 9714 1004 2614 1434 0434 1274 1474 1694 0554 1444 1183 9674 1443 9733 9984 0844 0784 0173 9323 9463 9414 0123 9183 9124 0463 9933 9633 9613 9803 9603 7773 9093 9573 9323 7783 8843 9053 9183 9203 8853 9563 8463 8293 7073 9843 7773 7813 7613 8013 8363 7283 6893 7813 7163 7843 7333 8123 6533 7833 7833 9273 8103 8273 6453 7553 7393 7003 7693 8573 6093 6543 7933 9343 9133 7243 8113 6743 7403 6683 7793 7013 7223 6843 6833 6903 5583 6893 6183 7063 6833 6853 6343 6403 6733 7913 6903 6253 6223 7183 7133 5743 5363 7553 6133 6713 5543 5993 5263 5343 5813 5013 6593 5813 5793 5443 4693 5123 4973 4713 5803 5383 4623 5033 5383 5483 5423 5523 5143 5743 4853 4493 3913 4853 4693 4353 4173 3203 3923 4113 3993 5183 3823 3883 4253 3953 4143 3553 4933 3873 4513 4453 4663 3633 4123 3653 3603 4243 3523 3713 3813 4383 4423 3453 4763 4023 3323 3663 3553 2723 2463 3333 3423 2543 3033 2553 3633 2853 2963 2973 2993 2683 2413 2533 2673 2383 2203 2003 2013 3673 2993 2803 3353 1173 2823 3293 2403 2093 2833 2333 2083 2513 3713 3603 2113 2793 2323 2083 1793 1773 1193 1673 1743 2403 2373 0713 2203 1313 1513 1413 1623 0913 1173 1353 1483 0883 1123 1493 1343 1453 0853 0483 0663 1123 1083 1652 9613 1553 1052 9422 9423 1153 0813 0423 0733 0493 0933 0203 0353 0343 0923 0903 1833 0193 0483 0382 9863 0902 9372 9222 8703 0323 0372 9322 9412 9212 9282 9842 8372 9402 8863 0082 9282 8962 8882 8312 9012 8452 8922 7992 8232 7792 8402 7232 8212 7172 7082 8502 7682 6882 7952 8292 7782 7062 7492 8232 6862 7322 8022 7432 7172 8002 7922 6862 8492 6932 6442 7372 7242 7092 7502 6502 7202 7612 6922 7122 8662 7702 6472 7172 7322 6562 7302 7332 6082 7822 7132 6502 6912 6502 6342 6492 7022 6082 5542 5772 6381 676 723100200300400500600700800900>1000Coverage value10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 523 755000000035 048 982000320 382 104000000000202 109 2960000244 364 6410000461 993 5630000932 082 1700004 442 575 18900510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %44 188 83299.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %44 112 20099.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %76 6320.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %22 133 59950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %43 652 90098.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

67.8 %30 026 06867.8 %32.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

997 07613 6438 09820 62010 14710 49517 65717 50114 67421 2436 7256 34311 68711 6404 19717 5215 8296 47512 21815 41514 47822 37820 55915 10229 48554 8212 834183 5813 6603 74110 1318 6665 62915 8793 4704 3479 6419 1922 49022 143293 78413 32911 30821 10816 92438 27850 03238 156162 02910 51715 78214 08817 1158 96816 20715 31911 99955 82912 90025 74541 999 943051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.83%99.82%99.82%99.83%99.83%99.83%99.83%99.83%99.83%99.82%99.84%99.83%99.83%99.84%99.81%99.83%99.83%99.84%99.84%99.82%99.81%99.82%99.44%99.84%0.17%0.18%0.18%0.17%0.17%0.17%0.17%0.17%0.17%0.18%0.16%0.17%0.17%0.16%0.19%0.17%0.17%0.16%0.16%0.18%0.19%0.18%0.56%0.16%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped