European Genome-Phenome Archive

File Quality

File InformationEGAF00008376672

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

182 276 810112 891 47771 457 04752 974 27739 214 54730 499 77023 768 64219 058 29915 383 96912 597 07110 469 6128 749 3707 405 5206 324 0485 452 0574 733 5304 150 4873 672 6783 260 7412 911 5622 615 1622 363 4822 156 0411 965 0361 817 0751 668 3411 538 1131 424 1791 329 5721 238 0631 161 8771 089 0571 024 540963 547919 322871 382830 298791 931752 335718 229684 543658 488633 200609 932583 783562 880546 792527 189515 311495 414477 921461 874451 465437 772423 057409 363402 450395 115380 019372 831364 170354 459349 992339 266333 900327 690321 416312 551308 672298 648293 524290 100285 006276 767272 867267 284262 030258 672254 594250 182247 381243 944237 050235 643234 038230 471226 733222 400222 772218 648212 617210 217205 444203 947202 615197 111196 246193 335188 193185 900185 114181 056178 188176 376175 514171 343169 014166 920164 325162 639161 621157 373158 625156 945153 528153 940151 352150 429148 069144 928144 232142 807141 132140 962136 988137 088135 363134 325132 908131 833128 998129 483126 978123 967123 285121 515120 504120 355120 748117 583116 075116 122115 775114 211113 347111 903112 007108 738109 187107 226106 410103 896104 659103 238102 578103 164102 394100 35698 63298 71395 88595 18995 16695 12993 82191 87092 82890 97190 81688 04288 69687 96888 30987 47486 24485 38384 42583 08384 14782 50883 00081 13380 35979 26878 72977 25875 95176 26374 14375 15874 39875 54373 28472 97372 70770 97871 25270 37470 83970 82369 56869 20069 97567 90967 23765 11765 42067 03666 52264 78064 53664 04263 25762 05862 49761 25961 49360 76960 51459 93259 51158 45958 55958 03058 59757 02256 98955 35055 03155 12354 42052 35853 71654 02553 68153 05852 26952 04451 63751 46150 04750 32450 58549 36849 98249 51948 09746 92348 33747 81647 26746 81546 42645 55746 02444 69245 56845 66044 65344 08343 41043 70943 54644 37542 80741 83842 30740 88741 73041 27240 76040 75040 03140 12040 28739 70639 97239 80939 00938 26437 81237 08138 04137 87838 03037 68536 48435 99236 73236 02235 77035 65735 77434 63734 58633 93734 11533 64134 26233 21733 00934 02933 39333 14732 47332 15231 72432 09731 71431 81832 20531 01030 80230 98630 13330 20429 74629 62629 28028 95128 73228 85528 45228 06028 07227 64527 75327 47527 65827 57127 46027 23226 73027 27927 22126 77626 71726 84726 80826 89026 06624 93225 79926 41925 27325 40024 55225 34325 71825 40924 83124 09024 46724 44424 51623 92124 15624 03123 63523 39823 94723 14923 17623 50423 11822 37522 57422 51622 03821 36122 55321 99822 03922 11321 71621 98321 81121 33622 01721 46920 85520 93520 56020 37520 39520 20420 36019 78719 89419 36919 59219 77619 32219 63219 16319 01219 54318 80619 48019 19818 77018 87418 28619 14818 95018 36018 82418 60818 00918 34418 46518 17918 00017 20017 69017 55417 55317 35417 48417 65217 04616 97516 80916 42715 95016 15516 69916 11516 28816 65616 20316 75116 40216 66915 78415 87615 60015 69415 89015 24815 75215 49315 57715 57715 46115 69214 56614 91014 73914 88714 97515 10914 72914 60914 83914 97613 74714 41014 06014 57914 54713 80914 00313 88913 96013 76713 64914 21213 69113 95113 18413 74713 27613 60213 48513 37313 19712 87712 77513 08812 66112 70513 11512 83112 32912 63712 86812 20212 36012 39912 45311 99112 69512 09612 30612 29912 63212 35312 69211 96711 76711 72511 66511 73811 45811 56511 46311 61111 63911 51811 40011 22511 44411 07111 51311 54811 28511 32811 14710 98011 15910 94610 92610 75811 07610 70010 37610 61410 51510 47810 64610 84511 06410 51310 57910 33810 44610 14010 39110 6729 98610 04010 2799 9679 91810 36010 2329 68610 0109 7469 7629 5589 6969 4989 6199 5429 7569 51810 0089 5479 5789 2339 5959 2929 1948 9429 1098 9889 2359 0738 8469 1629 0798 7618 5858 7648 7808 7388 6328 4628 6148 7428 4238 7228 3848 6798 5398 4878 6908 7988 2418 4358 3688 2128 4908 2878 6708 3858 1608 1517 9058 1427 8807 9787 8647 9768 0947 7748 0068 0177 8677 4167 8347 5207 5917 7377 5827 7977 4067 3847 2677 1557 4187 2137 2956 9897 4037 5337 6207 4827 2017 1877 2277 0987 1106 9267 0396 8606 8987 1656 9907 0396 3866 6996 7066 6776 4836 3546 7616 5596 4566 8356 7886 3466 5746 7236 2486 3216 5266 7706 3876 3886 4496 7806 7576 1426 5106 5116 2156 3126 3436 4836 0736 2756 1356 2085 8965 8626 0025 7286 0955 9935 9215 9766 0095 6455 6075 4725 6146 0405 6065 4415 3275 7555 7185 4775 6325 7215 9615 9115 7475 6475 4595 3415 4305 2855 3275 4855 6355 5115 4495 8385 5815 2865 2995 1985 3035 3054 9885 0715 0725 1865 1095 3145 1295 3685 2415 1135 1505 2725 0564 9715 1405 1535 1194 9414 8854 8594 7785 0794 8464 7624 6964 7604 8804 7104 7964 7544 7634 9565 0494 8024 4924 3344 6304 8554 5164 3004 4724 3444 5794 3474 8574 6864 4864 5774 6314 7004 1424 3594 6134 6334 5454 5084 7254 4134 6994 3224 3144 3894 3044 4114 0234 3714 4024 3374 6724 3734 3954 2294 2264 0374 4124 3294 0574 2664 2214 1233 8444 0334 0834 0393 9803 8784 0093 9503 8714 0493 9543 7994 1114 1084 0303 9283 7553 7513 8353 9253 8183 6053 6183 4093 6673 9313 6533 8803 6583 9293 5513 9493 5803 7063 6623 5713 5393 3743 6583 4713 5203 7013 3723 6683 5003 6623 7013 4953 6273 4023 4493 3773 3983 2403 4373 4663 5773 3263 0853 2133 2423 1123 2723 2113 5543 2983 5083 2163 1933 0593 0773 1303 1273 1933 0762 9233 3503 2703 1673 1913 2323 3813 1713 2992 9713 2803 0893 0703 0403 1513 1763 2003 2073 1363 2023 1523 0023 0473 0713 2303 0362 9803 0823 0453 0743 1432 9442 9993 0213 0883 0473 0923 1372 9292 7992 8102 5912 5952 9962 8152 8132 9812 7902 8602 7533 0282 8912 9032 6562 8782 8302 8202 6222 7262 8402 8862 7192 6852 8012 8112 7082 7252 6532 5482 7082 6212 7842 5242 6282 4672 6462 5902 7022 8562 9142 6442 6082 7052 6652 6962 6392 7322 6782 5902 5162 5472 4832 5922 6632 6292 6342 7722 5432 5612 5382 4562 6542 6502 4532 7262 4282 5322 3762 6102 3802 5492 3842 4612 6092 5452 3602 4822 4302 4182 4962 3312 2032 3532 4532 3482 5482 4201 778 902100200300400500600700800900>1000Coverage value10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0014 883 2260010 787 82125 493 774000437 940 969000000000627 008 39600000000017 991 625 436000000000000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

94.8 %120 003 46594.8 %5.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94.8 %119 990 58294.8 %5.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %12 8830 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %63 270 66150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

94.8 %119 990 58294.8 %5.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

53.3 %67 462 69253.3 %46.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

123 892 4114 709 7659 308 328132 882 996051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.99%99.99%99.99%99.99%99.99%100%99.99%100%99.99%99.99%99.99%99.98%99.99%99.99%99.99%99.99%99.99%99.99%99.99%99.99%100%99.99%99.99%100%0.01%0.01%0.01%0.01%0.01%0%0.01%0%0.01%0.01%0.01%0.02%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0%0.01%0.01%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped