European Genome-Phenome Archive

File Quality

File InformationEGAF00002467225

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

276 296 212162 273 80477 725 26743 129 03219 715 70712 447 3046 022 3774 355 4762 349 1311 848 1081 138 521949 305662 238556 666435 585377 366308 162268 937230 867205 904183 061165 473150 572137 153126 564118 292110 603102 68196 29588 40584 41580 89876 04975 45870 22166 00362 37560 04358 87755 94754 12153 13150 10248 57747 80445 74845 79343 06242 66241 97539 97839 39938 23937 36736 92035 64935 08433 96234 58332 59032 50331 28932 01231 48530 03229 92029 50528 87628 71427 59127 68027 41827 06526 58426 88726 33125 62525 22124 81724 85524 71824 37023 29823 04523 12322 81122 49422 43921 84421 80721 64921 68221 54120 69921 27019 93620 36919 72120 04419 58519 42519 09819 07019 64319 68118 92718 41119 09718 93818 29617 89618 63218 03917 57817 20417 22917 59417 11316 53517 13517 18916 51016 60716 53216 53216 29916 03116 09815 79715 53815 31515 03614 89815 17314 87614 80714 74914 66714 64814 95614 43114 30513 88814 61314 23613 65614 12313 58614 05014 03013 76513 69713 56013 56813 40713 27413 18313 24013 05013 00512 90412 82912 82112 57412 85312 75712 79512 47612 16412 29612 49012 70612 05512 47312 61112 40311 79011 84311 80011 73412 15111 92311 47312 05811 80811 82611 47011 45011 25111 18111 15611 47911 36711 11410 99011 30811 23911 11711 42310 94310 88911 13511 00911 22510 98810 71510 59610 81310 70710 06310 63110 66210 31810 57310 61910 38710 35410 12310 42310 02110 30610 60810 01210 2009 9969 69310 2709 8799 7799 5659 8319 4879 6469 7989 6979 5199 3909 3499 5869 4759 5359 6009 6509 1959 7589 7579 9579 3979 4539 1249 3918 8629 1529 1569 3219 2079 2649 3399 1079 3689 1009 1839 0848 8309 2648 9038 9559 0509 0988 9789 0158 6958 7468 7658 9578 6878 6528 9678 8488 6528 8258 5128 2508 7928 7808 7938 5818 6978 3248 4258 7418 5828 4778 3818 2768 1728 5018 2538 3778 2148 2698 3328 1278 1608 2808 3478 2188 0828 0598 1338 0678 2667 9358 0457 9228 2328 0807 8867 9357 7697 8937 8787 8558 1167 8967 6878 0788 0387 9297 5057 6927 7907 5017 5787 6697 6637 6487 4447 4777 3047 3667 5867 5047 5327 5367 4517 4367 4127 3177 3637 3467 3417 5347 4167 3667 2457 4317 4497 2907 3037 2427 3577 1857 4197 3637 3537 3887 1167 0617 1317 2227 1817 0827 1647 1307 1277 3657 4107 2497 2957 2547 0136 9246 9126 9977 0416 8876 7166 9467 0556 9006 6536 7496 6646 8276 5396 4946 6426 7716 6516 5606 8716 7096 7806 7076 7806 5696 6056 6266 6276 9136 7896 6106 7346 7636 6206 7276 5636 3616 4686 4316 6296 5756 4316 5486 6556 4216 5366 4946 2526 3796 4606 5056 3216 4036 2686 2996 3006 2836 1876 3076 2076 0926 3836 0846 2556 3776 3006 1546 0786 1266 2406 1266 1466 0025 9466 2366 0835 8116 1136 1156 1916 0455 9735 8435 8035 9135 7906 0055 8006 1136 0325 5865 8315 7355 8785 8416 0625 9365 9286 0225 8625 8305 9105 6315 8685 7505 6255 7725 7185 7775 5535 6985 5015 9055 7935 6815 6655 6555 6565 5005 7445 5005 6015 5645 6225 6135 6485 2235 3915 3415 2205 3565 3675 2745 3945 4975 4285 3195 2895 2835 3365 2555 4145 1305 4285 4755 2495 4135 2085 0375 1935 3225 2175 2125 2225 2505 2145 1755 3885 2465 1255 1105 2235 2265 0175 2425 3235 3165 1615 0855 0825 1995 2155 1165 1725 1015 2155 1445 0515 0975 1995 2115 0055 3555 0915 2535 0584 8035 0095 0634 9805 0164 9764 8634 8974 9724 8094 8165 0235 0954 8764 6984 9924 8904 7214 8564 8914 7964 7074 6364 6914 8264 6544 8064 7214 9104 7764 7554 9304 7154 7124 8784 9784 9274 7084 9544 8424 8264 8634 6714 8214 7244 5244 6254 6334 6024 6294 7324 7024 5644 6884 5344 7574 5574 5794 7604 4344 6804 7034 3974 5884 4114 6864 6144 7164 5984 5934 5104 4914 5424 5284 3384 4314 5094 6264 5744 3904 3564 2634 6064 6114 3844 5264 2494 5124 4344 4534 3454 2854 2194 3614 3754 1934 1674 4194 2864 2944 3904 4994 1754 3214 2464 2614 3604 4434 3304 2924 2614 3944 2734 3404 2934 3324 1454 2184 2934 3034 1654 2194 1174 2304 2104 1514 4114 0544 2704 1784 0844 0024 2334 1524 1054 1694 1964 0614 1114 0974 0504 0953 9034 0684 1053 9033 9743 9833 9423 9643 9053 9023 9194 0023 9144 0154 0393 8633 8953 8543 9293 7633 8684 0053 9503 9283 9403 8753 7653 7943 7563 8913 8263 9853 8003 8333 9333 9193 9493 7283 7013 8313 7483 6723 6493 6833 7563 6963 6093 8283 6573 7383 6703 6483 7083 6883 6413 5903 7163 6013 6953 5233 7743 5883 5423 5013 5523 6533 5263 6933 5503 5403 5003 7743 7063 6523 4613 5303 5963 3703 5553 3353 4143 3773 4653 4053 4443 3223 3613 5353 4613 5763 4933 3973 3293 3513 4133 2973 3123 4203 2163 3103 3033 2333 4033 2333 2973 1163 1023 2833 3223 2543 2783 2683 1403 2953 1673 2953 0523 2443 2193 2203 3053 1833 1463 2003 1733 1103 0963 1713 0643 1613 2363 2253 0783 1373 1223 1423 0953 2113 1243 0603 0223 2593 2053 1683 1403 0673 0263 1383 1303 0383 0172 9893 0523 0132 9852 9142 9992 9222 8512 9272 9792 9503 0952 8422 9012 9532 9893 0692 9062 9003 0122 9162 9782 8262 8773 0212 9772 9953 0252 9112 9282 9562 8722 7912 8472 8853 0122 9772 9132 8382 8812 8392 9612 7342 8222 8332 9062 8892 8802 8252 7732 9082 7712 7632 8752 6962 7072 7432 7932 8032 6832 7652 7382 8102 7722 7542 8122 6922 8012 6942 6822 7452 7132 6792 6752 6272 7582 7842 6912 6802 6052 5772 6702 6012 6542 6232 6992 5512 6072 7132 5742 6472 6332 5852 5712 6232 6152 5872 4922 5082 4962 6102 4662 5012 5202 4562 5262 3902 5392 5732 4672 4782 4712 4932 4512 4982 4362 4612 3072 4922 4852 5602 4742 4192 3942 5472 4552 4842 4232 3982 5302 4552 3592 4292 4712 3792 4402 4402 2592 3921 218 389100200300400500600700800900>1000Coverage value10k20k100k200k1M2M10M20M100M200M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 699 967000000025 695 146000234 874 586000000000148 736 9070000183 683 2650000359 465 5180000736 140 9980004 048 137 81300510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %38 190 44199.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %38 127 46499.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %62 9770.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %19 128 11450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %37 750 33898.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

74 %28 293 02174 %26 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

845 57612 1606 81817 4318 6149 38415 04715 28413 41717 9465 8955 4999 8209 8053 80014 8574 8295 85410 59313 88912 88419 13117 39812 64624 59446 6792 455153 1863 1603 1598 1997 1774 92913 3952 8293 6307 6117 9062 24619 258252 50411 6208 86717 95012 96831 66541 92227 258141 7498 43713 59811 01914 9476 85412 59912 2409 92547 9509 47819 67136 329 935051015202530354045505560Phred quality score5M10M15M20M25M30M35M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.84%99.83%99.83%99.83%99.84%99.83%99.83%99.83%99.84%99.82%99.84%99.84%99.83%99.85%99.82%99.85%99.84%99.83%99.85%99.85%99.82%99.83%99.59%99.83%0.16%0.17%0.17%0.17%0.16%0.17%0.17%0.17%0.16%0.18%0.16%0.16%0.17%0.15%0.18%0.15%0.16%0.17%0.15%0.15%0.18%0.17%0.41%0.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped