European Genome-Phenome Archive

File Quality

File InformationEGAF00003611510

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

54 929 85936 138 86920 161 09810 280 1025 035 7092 567 3061 370 204808 900518 617373 018278 538224 810191 364164 818148 363132 973121 155110 195102 61398 23192 12888 53783 80579 15877 05873 37370 47068 24765 12563 22461 92060 10958 19255 92154 82652 07252 09250 36349 24148 69647 72746 53646 12644 89844 48743 87542 87441 93340 81940 27739 96040 42439 37938 37737 84537 65637 17736 72236 72135 84535 96835 13734 90234 57433 85833 68733 06533 23232 82033 27232 90632 50532 00432 04431 48931 23231 67030 93530 76530 34730 46330 08629 33629 71629 08329 23028 73228 57228 12028 20728 32027 75828 08727 27427 77527 23827 16027 26326 86326 70826 53126 36026 55526 19626 26025 81625 89025 77726 01225 60224 83725 15625 43225 05724 63424 74124 61824 61624 36024 17324 19323 95324 07423 34823 91623 05623 58123 58623 41823 05122 97323 24522 93723 04323 05022 87922 85822 61922 59422 57422 57122 40622 14721 92922 00922 09721 75621 43821 48521 52221 36221 10820 86020 98821 25920 62620 52320 28420 17720 56620 31920 11120 07619 95519 61419 70119 64519 83119 86119 34419 46719 06518 90018 80918 89118 80418 79418 64518 48218 35418 23218 29717 74417 94017 77217 69217 64117 07117 20917 28416 61816 84616 57816 20116 18116 13916 02015 92615 88915 69615 77715 51815 46415 40215 11315 10615 12114 91914 94814 86914 81814 61014 45414 28814 04714 41914 16714 15614 10813 64413 79113 52613 41913 37013 54813 48613 22113 22813 08912 82212 96012 75312 70212 43812 48912 52412 56212 34612 24211 94711 99911 92511 79611 55211 66511 49611 46011 21511 04811 01910 97910 75810 75910 32310 49410 24210 26110 2359 9949 7299 9599 6969 4649 6619 3189 3049 0849 0789 1729 1139 1989 0248 8468 7328 7338 7858 5358 4858 4908 3828 0778 2148 2278 1988 0408 0267 9947 7747 7997 7577 6887 5887 5757 4637 4817 3747 4947 3287 2967 0557 2127 1266 9826 9637 0416 7786 6226 8656 6526 6086 5476 6276 5486 4176 3106 1396 3036 2186 3016 2396 2296 0605 9415 9715 8395 8095 8145 6285 5735 6255 4815 3425 2465 3515 5785 1585 2034 9595 0415 1264 9895 0064 7694 9114 7374 9054 7174 7034 7384 6494 6444 4154 5454 5404 5274 4424 3394 3024 3234 2744 3104 1894 2764 2084 0964 1604 1233 9913 9563 8423 9153 9363 7883 8473 8453 8813 5913 7063 7523 6363 7113 6553 4773 4983 4463 4533 5033 4593 3503 3203 3363 3723 2293 2013 1463 1473 0903 1172 9702 9873 0212 9463 0032 8102 9672 9472 9112 8222 8872 7842 7592 7892 7552 7262 7432 8592 7792 7652 6862 7502 7002 5882 6062 5682 6902 6352 5852 5772 5572 4522 5052 4302 3782 4112 4762 4542 4392 3442 3552 4572 3882 3452 3382 2992 2842 3322 2832 2832 1152 2702 1812 1802 1492 1242 2282 1412 0061 9982 0462 0012 0772 0451 9121 9781 9911 9211 9571 8421 8681 8231 8321 8161 7731 7791 7431 8131 7231 7671 7361 7701 7211 7221 6801 6471 6411 5751 6701 6121 6661 5261 6861 5741 6101 5771 5331 5361 5311 4691 5511 5191 5111 5091 4801 4531 4771 4391 4981 4661 5691 4711 3771 4461 3921 3621 5171 4741 3831 3521 3201 3261 3701 4151 3551 3481 3431 2951 3211 3081 3311 2921 2521 2741 3011 2261 3101 3141 2941 2721 3201 3021 2121 2831 2771 2861 2521 2351 2761 1941 2081 2611 2291 2221 2071 2341 1721 2401 1891 2281 1151 1751 1911 1821 1831 1981 1761 2131 1931 1741 2131 2071 1751 1351 2171 1381 1621 1061 1021 1101 1111 1421 1061 0811 0041 0701 1381 1241 0931 0941 1041 0791 0851 0791 0841 0631 0741 1781 0571 0649931 0791 0601 0571 0011 0631 0221 1051 0801 0121 0711 0369901 0581 0581 1049961 0541 0341 0361 0009861 0831 0721 0811 0481 0119941 0221 0489931 0031 0111 0329651 0389631 0449959721 0071 0079809878318909239199249568989279019468388667637968298388298537807838568758378248108518017807978067877677948457738137848588368027858628208197897828757948608518188308048097608097487547797557557387147697197067577227407096776706516636576346176006186316586226175976276426096835926115945806325615685495765225625205095565275435405255104954845145084465015014774994694734574754284604964274504284174044083753893823663493963593423623473333633463083163333312963063052622863253222822782512662943202642552732702952452633172482782282552242532462462132102211742192122182052071842071731632171591661711831671471571701771661601721551571231281551711361301501501301361191581281141371031111241321221189912512111995114111105124991089011696921079485817710881668275838874866669627570775676737968645556497369695655646747514947474654444434414146495438334235263836312832273128323835242529283241253321282030353329232121192324252429201420202721172426211513162214251714231519191316121416114 854100200300400500600700800900>1000Coverage value1001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

872 3160000000000042 631 17700000000023 425 969000034 119 544000092 009 7060000182 986 0460001 154 638 04200510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %20 332 44699.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %20 257 74299.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %74 7040.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %10 204 55250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %20 085 78498.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

55.2 %11 265 29855.2 %44.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

584 7476 3352 06810 7582 7832 9906 3436 2882 95310 6554 3684 2369 4565 3901 9898 3192 5333 8588 7178 5144 94010 0297 97213 11224 18438 3451 84595 7622 6446 0049 1794 4771 5879 4931 4712 6414 8974 7251 56411 293291 71310 1179 49014 63213 43417 00515 34120 72521 14140 78544 06830 730120 2016 02528 82213 5517 26838 7725 2694 76518 726 107051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.64%99.62%99.63%99.62%99.64%99.64%99.62%99.63%99.63%99.61%99.64%99.65%99.64%99.65%99.61%99.65%99.64%99.64%99.67%99.61%99.64%99.63%99.1%99.73%0.36%0.38%0.37%0.38%0.36%0.36%0.38%0.37%0.37%0.39%0.36%0.35%0.36%0.35%0.39%0.35%0.36%0.36%0.33%0.39%0.36%0.37%0.9%0.27%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped