European Genome-Phenome Archive

File Quality

File InformationEGAF00004855907

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

15 487 73035 061 88566 801 993111 147 755165 917 281224 625 537276 922 144311 992 905322 294 741307 298 993272 026 972224 771 771174 445 848127 772 79488 864 46158 923 86937 529 59523 116 48813 894 8758 290 2884 957 2493 082 3572 024 0231 430 9991 084 414876 574740 019639 585559 460490 590436 399390 566354 711317 377287 498261 718242 098225 553208 674192 333179 708167 774158 750150 770143 698136 923129 599122 464116 281111 359104 761100 48494 67489 29182 83979 66374 79270 37066 72263 03059 46556 25854 20251 13948 77846 42644 46041 60939 72438 49036 97135 15934 33933 00431 55230 54029 44927 97827 70926 98525 75325 13824 99723 96023 18022 25421 78521 39920 72020 20520 53819 85019 48119 09818 89318 60118 31317 98617 25416 96516 30816 46115 68015 87715 68915 12514 94614 37414 52014 17813 78813 53413 77813 29812 63412 61112 53412 56812 73812 46112 16711 96011 92611 82111 55311 30111 13511 04510 71910 62910 78010 4559 9899 9799 9049 7269 2969 3979 5599 1789 1969 1029 0938 7198 6838 2718 2538 2948 2238 0868 1807 8477 7377 6367 7347 6167 3827 4107 1347 1867 0196 9376 7386 9246 8496 6676 6816 6446 8026 3816 3106 3996 2776 0906 1565 9986 0046 0675 8035 9605 8185 8275 9205 7545 4715 4755 4545 2895 2325 5335 2705 3365 1015 4195 3045 0474 9774 9664 9104 8194 8874 8314 5724 7794 5904 6014 4814 5214 6004 4424 4574 2554 3364 1864 2064 3904 3824 4434 3714 1634 1954 2334 2254 1954 1574 1424 0773 9543 9084 0803 7753 9443 8833 8003 6283 7583 5963 6013 5323 4863 4753 3853 3713 3693 2053 3323 2773 5453 2503 1573 3093 2413 1103 1383 0873 1283 0472 9692 9162 9732 8972 9502 9623 0352 9952 9292 8343 0342 9142 9653 0212 8562 8582 8482 6832 6932 6582 6382 5532 5542 4862 4762 5092 6202 4302 5642 5622 4072 3912 5352 3912 3262 2672 3222 3852 4482 4252 3602 3832 3412 3552 2802 2362 1882 2312 2522 2172 1352 1392 1972 1692 1552 1912 0432 0191 9981 9601 9651 8601 9541 9711 9411 9881 9941 9211 9191 9751 8391 8711 9291 8261 9001 8231 8441 8331 8261 8301 8581 8761 8521 8591 8341 7611 8471 8471 7611 8831 8321 9421 8621 8191 7351 8271 7571 7961 7271 7961 7421 6861 7841 6861 7701 6921 7091 6681 6591 6231 6061 6101 5921 6411 6101 6211 5701 6041 5201 5911 5861 5511 5441 4681 5191 5461 5361 4671 5451 5221 5201 5111 5911 5481 4801 4981 6091 5011 5211 5301 4971 5011 4511 5111 4471 3631 3931 3141 4031 3191 4491 4431 4371 3231 3451 4011 3351 2681 3591 3291 3141 3251 2491 3411 2501 2361 2921 3591 2711 2711 2491 2211 2111 2651 1851 2231 1611 1371 2151 1721 2121 1541 1191 1721 1941 1221 1501 0981 1941 1341 1051 1641 1551 1201 1361 1461 1821 1401 1261 1621 2341 1781 2221 2281 2571 2361 2751 2111 2651 2151 1671 2001 2101 2371 1461 1771 1451 2161 1811 1531 1331 0841 1621 1661 1321 1351 1231 1871 1651 1611 1241 0981 0761 1811 0811 1561 1281 1341 1381 1581 1491 0581 1501 1121 1331 0971 1141 1241 1871 0981 0671 0821 0941 1121 0071 0211 0491 0211 021955962968930920974871906878848863867918859885874887910927893827850812879889823884922910826844793790790778806812742754818763801756768838772809820751723785807757754733776699740755742785750769767767780723799700745763722751706701700707744701739693685643674698636624723632641629605663618667654642654653669661631645672622623626569604604622587625642618596642654606544560575633630593550599600607657600612585589577533552541572536567547507516541508597532526500514521517503527527546515501541516527520570549532513530511511489553526525494542545512522545508555565543544516531534555538548515542522503539530502537508480499514503462475476476495459494465481506448463457497460474472474466445481475449457443485444447461479486489430505478472451493472457465504440503451462416477423512485476437459474450466458481509484440413422434469456440496459381430438471390436401398385351415408376396395384395412429369424412435389398403409423443470384446409396403410351416427447401388394405431434415403379364391406398365365349362326370338374381385408379387392399413377418390399402414366363408384365368372384375370433409384394369363369380374382395384409351381393408357313391377395374385382389344390396352409362396396380390355368370376386395347371350388352346320360380365365384370375366370396371353390351372375341414364407393378387411410371366380388325371357414413358394367376362368361381368360364397360335377360322329316357351394360344363337328354351319347346385340295353352333331344350327350 937100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

6 312 34800000009 368 8390001 361 655 169000000000763 644 4500000865 123 91200001 820 878 72800003 969 256 43700020 295 038 08700510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %192 171 04599.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %191 764 76099.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %406 2850.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %96 328 73550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.2 %189 137 01698.2 %1.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.6 %8 780 6834.6 %95.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 278 578208 322123 079242 340181 583191 158215 925305 458123 343217 111100 66588 307128 204142 05271 796163 948122 214132 448168 675225 262237 612225 375254 675201 827328 888545 66534 686956 25653 79952 719114 637112 43345 872130 28653 34752 45390 818122 49628 846176 4502 711 502114 664109 644181 972155 709281 802244 838362 795600 74567 98991 10082 255104 01867 064112 71597 28567 842257 96670 819141 443172 077 879051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.78%99.77%99.78%99.78%99.78%99.78%99.78%99.78%99.78%99.78%99.78%99.78%99.77%99.78%99.78%99.81%99.79%99.77%99.81%99.77%99.78%99.78%99.87%99.7%0.22%0.23%0.22%0.22%0.22%0.22%0.22%0.22%0.22%0.22%0.22%0.22%0.23%0.22%0.22%0.19%0.21%0.23%0.19%0.23%0.22%0.22%0.13%0.3%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped