European Genome-Phenome Archive

File Quality

File InformationEGAF00003612748

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

147 830 44471 659 02926 297 26813 204 2614 543 0843 088 0481 211 8891 029 461544 025473 739312 400281 836214 130195 774169 455150 634136 038126 262116 190106 580101 62195 25691 20286 15482 20977 80276 76273 52871 56968 39365 70764 05062 32661 53559 18557 18256 16553 90953 85553 41552 48751 33450 26048 45747 81846 08746 16745 56643 86743 19442 36142 29841 12640 65340 46039 21338 63738 08637 38135 99836 29636 38435 14034 72434 95533 77233 44732 52332 61632 47432 13431 42531 25631 44431 34031 20030 83530 13330 04629 53729 29628 82928 64828 73928 64028 73427 67927 96727 30427 72627 80826 78526 59026 49326 29026 20426 09025 59425 35124 96224 86324 76325 02223 99624 43224 26223 98923 58023 61324 02522 73823 16923 36323 13222 58222 84422 40722 41421 80521 38521 72121 18321 47921 46521 48621 31220 73620 90221 09220 64720 59320 04920 02520 37319 63720 00619 20519 52319 23819 32219 22018 96718 98919 06618 46018 68518 73218 35718 50118 47317 97618 13118 47618 02718 29817 75817 90517 21817 97717 40817 61516 88017 26917 24716 70117 01616 87416 46216 33616 58816 00915 91615 85215 56815 91215 77915 62415 67515 15115 27015 40515 19214 74915 04614 90714 83314 73415 05314 80214 60614 53714 23814 38614 52414 42214 39514 74614 32814 07714 27113 74513 74314 19313 69713 75413 70313 71013 44313 37513 36013 13413 05413 16812 98812 78112 62712 69812 86912 85112 83112 64812 74112 44212 62512 39712 45412 16712 23512 15911 87811 61011 96511 81511 86012 06511 80111 63211 80811 63211 72511 68911 37311 53511 55811 51211 32611 41911 23611 21811 18911 23210 85310 82511 02410 62510 94810 89410 57210 70910 69910 59610 40010 18810 21510 0829 97110 1689 7899 9289 9749 8459 9039 9629 9389 7389 6819 7039 6449 6479 6249 7239 4249 2529 2869 4869 0609 1449 0829 0499 0339 0238 8108 8798 9208 5968 8148 5718 7218 7508 6048 4908 5238 7278 5608 5548 2398 1688 5268 4888 1628 1828 1217 9528 0187 9507 8797 8718 0677 8667 8677 8197 6947 6927 5717 6197 7097 4957 4437 3067 3427 3797 3017 4417 3897 0887 2317 2956 9696 8917 1526 9566 9737 0147 1027 0897 0576 6836 7616 7216 7906 7756 7656 5036 8176 8316 5646 3866 5456 5956 5796 2526 4056 2296 2386 0866 0005 9366 0855 9835 9165 9895 9675 8815 9575 9895 7915 8055 8775 9075 7345 7395 7555 7745 5115 6265 6445 6015 6005 4505 5165 3975 4985 2995 3255 0904 9205 1355 2184 9385 0184 9154 9784 8514 9674 8344 7404 8404 6934 7064 6974 6874 5824 7854 6254 7074 5704 5154 5244 4644 4694 6564 5964 4444 3734 2404 2374 2184 2714 0864 1494 1704 1964 2244 0514 1214 1534 1033 9143 9424 1163 8134 0833 8553 9733 9444 0203 8873 8343 9403 7233 7573 6583 7663 5923 7463 5923 5943 6353 5923 5363 5413 5373 5043 3453 4353 2983 3653 3183 1713 3313 3513 0163 1933 1413 1163 0803 0853 1713 2673 1013 1343 1793 0403 0403 1093 0253 0043 0253 0752 9812 9212 9222 8862 8352 8782 9282 7912 8032 8602 8902 7842 8142 8192 7052 7422 6402 7682 7282 6772 4762 5782 5672 5342 4582 7212 5812 4342 4162 4642 3852 4112 4112 4202 3302 3792 2742 3072 2332 2282 1022 1832 1612 1622 1912 1782 1712 1162 0581 9941 9302 0271 9691 9412 1332 0041 9871 9671 9752 0011 8891 9401 9611 9181 8851 8371 7921 8671 8581 8721 8051 8151 8981 8821 8301 7521 8501 7931 7961 6461 6221 7871 6631 7461 6341 7021 7051 6271 6921 5971 6081 7141 6061 6241 6051 5391 5571 5621 5781 5701 5431 5891 4271 4941 5031 4971 5111 5091 5161 4111 5571 4561 4761 4231 4161 3681 4621 5061 4381 4281 4441 3831 3821 3881 3671 3481 3211 3251 2781 2861 2811 3301 2791 3051 2761 2311 2691 2581 2321 1901 2171 1901 1801 1811 1651 1151 2481 1751 1131 1341 1921 2411 2351 1581 1231 1321 1091 1221 1111 0161 0501 1211 1091 1631 0871 0811 0661 0731 0811 1121 1301 0121 0211 0711 0201 1371 0971 0021 0771 0141 0581 0941 0471 0541 0561 0351 0281 0511 0771 0121 0321 0219751 0421 0411 0141 0519959729541 0169909969729871 0149879771 0271 0139659901 0219481 03195197690992992790489186984792485090995996089191485190495292192588791383691097395883086690886090782986495394888691388685589187482787483985281584786287681681887684083985084783384182679679385285684483577978376174877178080380778476882977374176478477077276976876175779474477575976378671779573379280675973978380870574278476472474476971467372870970368876171266675876372276072074873971073676672970872269877076974478875868870670073072475671673471272173671272178465774469574570763868866768367364161865167259765865764064165566463868064062766766460862865263865062858960263059561661158257259261257659459856158558758659057358359354955358352750452158149349553251250154247950846451747651146549151144747748346951445644344942346040338241441236444837640739042040337938538437243238139136537434336735035337833636932233532333234234135830030132632434532529129930732232633432028728828030330726531027224227628024633 833100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

291 65700000006 283 521000113 842 34400000000068 853 124000079 684 9770000150 653 4620000315 069 7170001 291 028 29800510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %13 478 81899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %13 453 97699.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %24 8420.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %6 752 35750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %13 333 60698.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

55.5 %7 495 97955.5 %44.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

296 3144 8382 7717 3273 3973 8386 1455 8414 8567 5002 4241 9573 8693 7551 4095 8901 9872 2894 3004 7484 4827 3306 1144 9429 72618 86098360 0521 2381 2103 5712 9271 7485 6741 1271 4823 2923 0668987 348103 7544 6454 1227 3865 99113 14116 80714 23053 4033 6295 3514 8495 9413 1175 5195 7304 60718 7404 5998 64912 764 725051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M12M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.82%99.81%99.82%99.8%99.82%99.82%99.82%99.81%99.82%99.82%99.82%99.82%99.82%99.81%99.79%99.83%99.82%99.83%99.83%99.81%99.81%99.81%99.77%99.79%0.18%0.19%0.18%0.2%0.18%0.18%0.18%0.19%0.18%0.18%0.18%0.18%0.18%0.19%0.21%0.17%0.18%0.17%0.17%0.19%0.19%0.19%0.23%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped