European Genome-Phenome Archive

File Quality

File InformationEGAF00003611488

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

73 634 98458 028 54438 177 59922 041 66411 850 0306 321 3593 468 8712 036 0841 251 404830 379582 580436 204340 654280 719233 810202 033180 897158 273142 110129 017118 217109 909100 16593 73787 50782 54078 92373 02969 12367 32763 70261 53059 44256 74054 76752 48851 66949 43749 19046 29244 70945 29044 46242 87540 62540 65239 75738 48837 36936 66636 50934 96534 53334 22633 76033 32532 97232 07132 40131 48730 90630 11029 65429 45628 87928 69928 13728 15227 11126 51226 30526 57826 39225 72426 31925 74625 30825 15025 25724 50724 28924 06224 38623 48022 99324 13222 84322 95922 54722 48622 75221 58522 21221 85121 41421 39621 18221 45020 62620 92020 67120 06920 39019 91220 35019 54320 20419 64119 77919 25919 34019 52118 95318 82918 97418 18918 96518 56718 47318 84518 24218 09118 02817 89317 98717 81117 57517 56918 17717 31717 46417 65517 30517 13216 99217 14716 84317 20517 25517 12216 89116 43116 79916 74016 56616 17516 65116 67116 18716 36816 10216 21916 52715 97916 24516 23715 88616 08815 72515 58415 20515 76615 91515 33915 82915 63615 05915 57915 08715 49215 27015 33114 71615 31715 10514 68514 99914 87614 66514 87314 33314 25514 45114 07814 37714 22414 33714 25014 17714 36814 08514 03314 11813 76713 95913 61113 57613 62413 82214 01913 67113 50813 77713 92313 66113 41113 53013 61213 57013 37013 51013 26313 36913 48113 14313 54313 26313 10513 36513 24412 83512 81812 57012 85112 92212 77612 87712 71112 72112 77412 50112 79912 42712 92712 14612 44612 08512 17812 08012 38312 24811 63412 10811 89512 33512 01712 02812 15912 01512 05911 84211 62812 02011 77011 46811 67711 69311 73411 59911 49311 46211 10411 34311 31811 33511 48011 27011 37811 32111 30711 41711 01511 21611 37611 20811 09511 01211 22811 23011 11510 95511 07411 03910 95111 12110 66510 60810 95010 80910 81110 48710 62110 95510 73110 56010 72210 69710 19610 23110 46710 30910 18810 21210 43310 31110 28510 31310 17710 12610 00310 19110 14310 00610 2259 7999 9199 83410 0839 9879 8319 58710 0759 6709 8339 5739 7819 8549 6509 3819 2349 5609 4189 4899 2419 2429 5859 4809 0809 1049 5159 4409 2009 1869 2218 8709 0488 9948 9399 0189 1579 1998 9278 9579 2208 9448 6728 7478 7528 8058 5908 7458 5558 6448 5868 5018 4918 5838 4248 3698 3578 5078 6288 5988 3268 4748 2788 4618 2108 1418 1568 4148 2998 2368 0147 9457 9727 9147 9797 7547 8597 9957 8367 9887 6497 8727 9437 8447 9227 6707 7317 6997 7267 6367 7537 5297 5507 3317 3337 7107 4237 3527 3597 2537 0017 1057 1627 3487 2487 1007 2037 1756 9957 0717 0136 9306 8236 8956 9456 7736 8086 9726 9146 6426 7236 6746 8766 4396 8166 5266 6536 8046 7716 4066 2966 3716 2786 1646 5236 3816 3266 2496 2806 0346 2956 1556 1026 1676 1286 1066 2955 8195 9035 7566 0505 8775 8465 7145 7845 8785 7655 7745 7085 5915 7425 7035 6705 6345 4955 7345 5585 5695 5495 5875 4705 3765 3945 5405 3485 4215 3255 2515 2285 1835 1715 3045 2675 3715 2105 2495 2025 2465 0364 9715 2065 0855 0215 0914 8885 0194 9264 8454 7794 9064 8454 8674 6694 9284 9534 7654 8054 7454 6824 7634 5544 6654 6334 5834 6754 5344 5754 3134 3954 4684 4684 4874 3744 3764 2264 4914 3704 4054 3554 1494 3324 3484 2324 0894 1754 1314 1864 2414 0954 1143 9734 0314 0594 0513 9714 0923 9493 9993 9903 9683 7583 8553 9133 9333 7433 9353 7703 8873 7483 6943 8693 6333 7323 7763 6733 8273 5643 6793 6843 5853 4973 4953 6033 5473 5753 3853 5763 4613 6103 5093 5263 5483 3453 5003 5023 3463 2443 2963 4163 1673 1603 1793 1963 2203 1913 1593 1023 1743 1963 1253 0133 1573 0853 0633 0703 0463 0732 8873 0093 0003 0783 1042 9492 8653 0072 9812 9712 9382 9652 9132 8292 8762 8332 7862 7733 0192 7262 8162 7872 8222 6932 7332 7392 7932 7662 5792 6762 6152 6432 5512 6752 6892 5352 6652 5902 5642 5132 5812 6552 5872 4552 5342 5102 4432 4592 3842 4532 4242 4322 3582 4502 4302 3912 4082 3662 4582 3382 3872 3362 3822 3992 3592 3272 2942 3462 2912 2812 2402 2122 2452 2442 2922 1872 1532 1092 2362 0812 1552 0532 1762 1682 0982 1682 2392 1122 1192 1262 0402 0732 0811 9812 0012 0641 9012 0432 0271 9842 1051 9541 9601 9501 9361 9861 8461 9401 8731 9291 8561 9691 8891 8661 8641 8581 8751 8141 8171 8551 7381 7451 6881 8111 8281 7611 7631 7071 7871 6981 8011 6191 6611 7461 6131 7501 6741 6321 7071 6151 6861 6491 6611 5431 6711 6211 6601 5491 5741 4801 5331 6181 4991 5361 5971 6251 5081 5131 5351 6271 5981 6031 5261 5281 4961 5211 5411 4471 5011 4311 4551 4831 3801 4951 3901 4571 3721 4531 4021 4191 3771 3421 3581 3871 4001 4271 3701 3511 3961 3251 3941 3741 2591 2851 3181 3361 2331 4241 3831 3541 3841 2921 3401 3231 2991 3081 2921 2971 2111 3401 3771 2701 3251 2371 3371 2191 1951 2421 3131 3011 3311 2781 2631 1831 2521 2521 2061 2821 1931 2651 2331 2181 1871 1611 1921 0921 2351 1301 1391 1571 1191 1121 1311 1361 1691 2051 1701 1841 1831 0871 0901 0341 1051 1331 0701 1621 0841 0671 1051 1181 0791 1111 0511 1001 0631 0641 0461 0561 0031 0141 0271 0541 0231 0131 0071 0181 0041 0581 0039659981 0061 0549841 0051 0229449858878899371 0099248669868861 011941919948899976944882883906872831905908944893856919906954918917896828953910849867817863939839829926798829840804891868816768749808801847795847883806732859816846746803807808828809851866786795794815794808848826787828812728768765774778769777756755791784283 486100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

768 62400000000000118 653 55000000000065 544 773000098 617 3050000213 205 1090000434 648 1930002 055 158 74600510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %39 679 99999.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %39 544 81299.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %135 1870.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %19 910 64250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %39 322 77298.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

81.1 %32 288 99081.1 %18.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 067 34512 1813 95520 9194 8025 62013 30211 8345 20920 8768 1778 19718 6179 4123 38916 6764 9956 72915 69015 0699 35319 40614 98623 70445 13471 1322 973198 4034 7569 92917 7198 5792 75020 2502 7944 96210 3438 8222 35822 390572 59818 28618 38728 11623 85735 46330 45138 16741 50678 32683 49568 469253 00612 51256 95424 26514 68377 9709 8299 48736 581 955051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.66%99.65%99.66%99.66%99.66%99.66%99.65%99.66%99.66%99.63%99.67%99.67%99.64%99.65%99.64%99.67%99.67%99.65%99.69%99.67%99.64%99.67%98.59%99.75%0.34%0.35%0.34%0.34%0.34%0.34%0.35%0.34%0.34%0.37%0.33%0.33%0.36%0.35%0.36%0.33%0.33%0.35%0.31%0.33%0.36%0.33%1.41%0.25%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped