European Genome-Phenome Archive

File Quality

File InformationEGAF00003612298

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

59 310 73351 737 90638 700 29825 302 29515 252 4308 872 7935 124 0583 074 4251 882 9791 246 502845 228596 567441 445350 928276 221227 438198 941165 469150 865135 174120 221109 063100 78091 40287 72584 70381 12172 92567 73863 62761 27859 83458 65656 69253 98854 05950 46349 45648 38846 31646 82343 08942 95242 35343 07240 24438 28540 58338 98737 28735 65336 88736 68835 06533 63034 25434 01733 49733 38332 33031 69632 12130 75831 44030 65730 00029 12127 66328 59428 24328 88627 81226 40925 95426 40426 19626 68826 42225 62825 10326 11224 47324 97725 12923 77424 33223 70723 89423 25323 91022 99822 65522 52422 20222 13221 77119 96122 05021 37221 54121 25021 63821 43720 86518 21420 54820 97820 66020 53320 13920 19019 19619 92519 14018 85519 03319 02819 73718 56017 92018 18717 34317 67917 43417 99017 52117 46818 69818 02217 02417 11518 13717 79916 25016 79717 20617 18117 44017 04217 01816 12016 43917 02216 28016 16316 93316 61016 16015 33616 13316 70615 43315 49915 40413 91715 33116 01915 39214 99314 72314 33115 58314 59114 64615 10514 85614 99514 95914 43913 74314 63314 87913 78014 18813 87014 26313 64213 41513 93313 80114 15913 93113 74314 06812 90813 09813 01213 09413 21612 90613 08412 10712 29312 58112 00812 78212 19212 14812 24912 02712 79812 25712 94011 91213 07712 55611 92312 44712 98812 11312 49311 99711 93212 37311 42211 71412 25511 45612 10811 74711 76711 58911 36611 88110 90311 33711 51811 68511 27010 88910 89410 69010 93410 15411 31710 65110 18810 43610 72110 60310 73010 37510 38010 64510 52510 06710 09510 67210 36810 0029 89310 60710 30510 38110 50110 3439 64610 4379 6849 3239 7229 7899 9219 4029 7769 4769 2179 8329 2918 8399 5389 8259 0549 5739 8429 7659 0909 7509 6488 4609 0958 7888 9408 9609 2899 0149 3858 3658 6508 9899 0619 1039 2389 2518 8348 2949 3868 5678 3628 4898 3748 7878 5828 5828 3998 3887 9028 5608 1628 0527 9838 2158 7498 6457 9978 6248 1197 9258 0667 8117 4958 2998 2838 0227 8827 6357 9058 0298 0037 7848 2058 2017 7797 8547 7587 6837 4257 7197 4016 7657 3667 1606 9877 4657 3527 1497 0577 4367 7706 9417 1576 9897 3357 2976 9257 0206 6466 8156 8487 1317 1517 1357 1376 3936 6786 8786 6656 6736 8766 6496 9536 6256 9127 1726 6016 9606 7797 0536 5876 4726 4666 5665 7526 2236 0976 4016 1386 2826 6276 1176 1676 2206 3826 1266 1225 9096 2096 4466 1126 3916 3895 9696 0065 6515 7776 1186 0825 8966 0135 7825 8376 3356 2805 6435 9046 0505 8676 1195 5855 9245 5585 8195 7155 4395 5986 0205 1645 4155 9405 6375 6785 4265 4465 6865 7655 8005 8605 6305 6795 2795 5495 4655 2155 2054 7815 0595 1415 2795 1725 2595 1834 9504 9555 1425 3124 9254 9464 7625 2594 9394 5915 5574 5284 8775 1145 1095 3544 9684 9554 8824 9155 0814 8554 4464 7984 9235 0974 4904 5594 4884 5414 6104 9074 7504 9045 0144 4664 6124 5604 4574 2794 5184 8364 2104 0244 3034 3674 6104 2604 1734 7664 1944 3814 5154 3504 5574 3014 2194 1014 2804 4684 2224 1114 2584 2503 9954 3673 9353 7814 1324 1624 2894 5703 8423 9024 1113 8123 8644 0524 1594 0363 6343 8654 0763 4823 8403 8024 3044 0383 7233 8563 9923 6003 7604 0033 4673 9003 7243 9923 6343 5963 5993 7213 5783 5413 2563 4903 3523 6233 5403 2743 3613 4913 7713 2733 9823 6863 6633 4693 3803 2303 5893 7073 3873 0723 3303 2363 6923 4983 0483 6093 2733 2233 3073 2443 3773 3723 3643 2133 2253 5033 2363 4223 4903 1053 3113 1293 5213 2772 9543 1793 2233 0442 7383 2472 9153 1443 1672 9223 3713 2943 0423 1312 9123 2313 2743 1593 1193 0362 9162 6113 0802 7303 0852 9772 8472 7862 9822 8512 9292 8282 5092 5452 8232 9742 5342 6162 7502 9002 7182 8762 6982 8462 7752 9702 9062 7082 8542 8402 9612 4672 7452 5922 8222 7712 6712 4572 7802 5052 9232 6632 5362 3582 4932 5802 3972 6052 5462 2272 6472 4822 7932 5132 6832 7052 3272 4142 1712 3232 6062 5272 2842 3322 4652 7132 3192 8362 5472 6782 6132 4712 3632 4422 3902 3082 1402 2732 1732 2552 4672 4092 2542 1262 3032 2752 3412 3801 9192 3022 1702 2742 2042 2392 1412 2882 1831 9532 0602 1832 0772 1922 1992 2212 0892 3082 3112 1992 0661 8522 3402 0141 9662 0262 1802 2552 2102 2412 1031 9991 7982 0002 2721 7941 8291 7811 7932 0431 9862 0222 1041 9941 9321 7491 9462 0651 9791 8031 8421 8101 6421 8961 8441 8811 8552 0521 9721 8951 8271 8201 8751 8241 9221 8481 9651 8121 7151 7781 5891 6481 7461 9081 8131 7301 8061 7761 7311 8691 9271 5432 1881 9471 8581 4921 6911 5011 7431 7701 8831 5701 6741 5951 6841 7841 8601 8761 8281 4351 5991 6191 6241 5281 5311 6401 6011 5761 6281 8241 7111 4881 7081 4651 7671 6641 4991 3951 5531 5941 5481 4271 5251 4651 6051 7761 6331 6161 5531 5561 5101 5041 4411 5531 6961 5301 5241 3951 3931 4001 4691 5421 4501 4011 5631 2931 3361 4911 3531 5701 3001 3311 3411 3081 5121 5931 4281 4381 3911 4661 2351 4101 3321 2341 1731 2281 4111 2901 4611 1861 3461 4251 3851 6701 4421 4861 3571 4041 4491 4971 3491 3941 4401 4881 1821 1981 3691 3571 4821 4091 3461 3311 3231 4751 2311 2111 2431 4211 0571 2991 3161 4591 1941 3251 0871 2901 4231 2801 3501 2031 2821 1331 2741 1651 0981 0781 1551 2261 1601 2881 1101 2151 1211 0879891 0721 2721 0781 1181 2031 0411 2541 1991 2561 0871 2371 1201 0841 1271 1301 3021 1041 1711 1071 0991 2671 0511 0411 1051 1151 0319881 0141 0531 1921 1659571 1601 0958771 0501 1321 1181 0811 2881 1851 0591 0481 1531 0881 0301 0508951 0008771 0229259989319721 0939161 0149269971 0859271 0561 071798859935833409 422100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

299 9910000000000091 604 47300000000052 344 968000080 640 4520000199 729 6320000394 707 0750002 358 259 90900510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G2.2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %42 226 29499.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %42 112 83099.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %113 4640.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %21 183 91050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %41 771 85898.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

89.9 %38 101 94189.9 %10.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 093 83613 3965 84226 5806 3636 23514 32713 0786 63825 3859 5798 93427 95110 5565 27721 2495 9347 84215 02516 22711 02920 82117 59125 28744 03075 8614 055255 9806 73017 26122 22610 7564 05425 3013 9065 4909 9759 1663 60828 172643 51521 41118 75431 37329 36837 92032 05044 69044 92489 88197 62169 167338 99612 75864 18029 14113 54598 87611 9297 97238 744 298051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.73%99.72%99.71%99.72%99.74%99.73%99.73%99.73%99.73%99.69%99.73%99.75%99.71%99.76%99.71%99.76%99.74%99.7%99.77%99.75%99.71%99.72%99.71%99.76%0.27%0.28%0.29%0.28%0.26%0.27%0.27%0.27%0.27%0.31%0.27%0.25%0.29%0.24%0.29%0.24%0.26%0.3%0.23%0.25%0.29%0.28%0.29%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped