European Genome-Phenome Archive

File Quality

File InformationEGAF00006164760

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 157 9114 391 7225 977 0779 778 99616 620 61427 758 15344 153 69766 051 19192 634 060122 245 656152 010 603179 088 044200 805 796215 016 988220 558 277217 671 270207 080 187190 542 658170 067 885147 508 882124 521 714102 548 13382 470 44565 014 50650 232 67538 142 56928 551 41321 062 75215 367 62811 101 3658 003 8025 763 1864 172 1173 041 4432 254 5401 700 0511 313 1381 049 481858 740723 083618 476543 052478 918431 052390 521357 250329 968303 244279 666260 214243 338226 806210 907199 014187 993176 752166 549158 433149 815143 616136 137130 080125 368120 833115 719110 322106 911103 415100 12596 97095 11789 89289 04685 93983 45981 15578 84376 30174 86872 37869 76467 79366 15265 07363 34761 45659 91457 56256 38354 85552 99550 10948 88447 00645 42743 86141 85039 67038 24736 66735 86034 48833 89732 92731 79430 78529 32129 10828 64827 56626 54026 29225 66825 20124 16623 92723 07222 47721 91421 32021 16420 82020 19419 65818 80018 66218 15618 11917 31917 36517 04516 50016 20015 69515 41015 64115 21015 01614 66714 61613 96813 81413 49413 16912 81612 67712 40512 38611 96411 87711 75211 19711 25311 02510 71410 68810 2279 86910 04910 0049 9509 8939 6569 5749 4859 2909 2639 3889 0818 6498 5618 6648 4688 2048 2768 1758 1698 0927 9307 9107 6997 6447 5217 5647 2957 3177 4217 1156 9606 7926 8556 5496 6496 5246 5346 3806 5136 3376 3566 3916 3576 4356 2725 9526 0415 8295 9616 0115 9805 8055 6875 7515 4885 5915 3125 4415 3235 1485 1175 0915 2705 3455 0705 1225 0455 0684 9874 8874 9364 7974 8624 6474 8384 4874 6574 4584 5064 3924 4304 3904 3974 2934 3254 1824 0664 1584 0964 0914 0134 0924 1864 1533 9583 8563 8143 7513 7653 6473 5253 6873 6103 6503 6953 4593 4173 6253 5223 2593 4953 3643 2833 3693 2593 2533 1943 2693 1153 0893 1163 0623 0613 1313 2733 2143 2393 1093 0943 0812 8713 0603 0142 9942 9792 8952 9352 7902 9752 9102 7842 8112 8322 9052 7632 8002 7862 7442 6772 5532 6942 5792 5642 4982 6332 5922 5832 5952 5502 7372 6112 6142 5222 5562 4802 4462 6342 4382 5482 4012 4442 3772 3252 4062 3532 4532 2852 3292 2322 3202 3862 3852 3172 4122 3882 4062 2252 2182 2802 3262 1872 2422 3292 3212 1402 1122 1062 1422 1062 0342 0842 1692 0562 0242 0232 0221 9871 9232 0752 0282 0372 0401 9531 9782 0101 9271 8701 8941 8351 8421 8861 8231 8161 9511 8321 8631 8751 8231 8781 8211 7811 7881 8461 8291 8731 6851 8601 8461 7871 8341 6761 6681 7541 7531 7371 7021 6761 6861 6861 5911 6771 7331 6511 6531 6191 6481 6031 5711 5861 5431 7451 5771 6411 5551 6241 5311 5141 5341 5391 5141 4431 5231 4691 5411 5741 5181 4441 5031 5431 5221 4791 5771 5111 4691 5211 4631 4501 4581 4611 4461 4531 4301 4401 3971 3461 4351 4051 4061 4211 3581 3891 4121 4151 4001 4481 3971 4171 5371 3321 4041 4411 4101 3681 3591 3061 3631 2841 3741 3561 3191 3351 2691 3521 3251 2941 2541 2321 2631 2541 1961 2881 2051 2441 1971 1901 1221 1861 1291 1071 1781 1491 1921 1081 1361 1471 1161 1041 1701 0861 1621 1371 0651 0681 0541 0731 1001 2131 1201 1231 1111 0741 0781 0671 0381 0381 0181 0851 0041 0381 1841 1531 1291 0819719911 0871 0881 0181 004928980994990994924939867950972978927964981956979911905934951903905892862919877941916871888927864915861901890875835847904831861848860815865800826793826882850851852851889824867778823779828824831853824822787809797861869793827810815812794783721761773765813776808834825820830765845701722768759752797755765727719716701697725708722724777702665709734720726737738671699690686717644731678705712676714716662674626666693642618618637614673661647724662672678657643586575643629613632608581600587578625662630615573598594605561591601526529532573554585575496497570549532572543554526529510530616548599492543546527540572575533496509532527491543534531515505522517528521533523524497528507484491509491553552492522519524527524513541499538555644559564539549565531560535497573518547514531509473492501499483501463490509468483473462479467459514430461435461455466439440441443391456461434473462453450453449439482476471480451424414441436488438439426436454479468488424393463410446394470455446432438464435467476449434428441445436439437437409493469429440464537489472417421436427436520478508420459472468443421447473454414399423440381410419398421469433459404434409412429407440444410434393442427414397422374398421422399375371381370356372371378407376403372385386355359361360401347340374362332359332375359365323323376370349342379331359370358340304336318347318345329366363342336361332330315357309310323342324338308313331328332329435 976100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

587 20200000000001 799 860 76100000000000002 717 355 6190000000000044 389 969 05200000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %323 098 60899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %322 720 21099.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %378 3980.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %161 946 26750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %316 896 94497.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

15.9 %51 388 58915.9 %84.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

12 356 761269 881166 037550 739244 538243 062315 195500 667181 632294 491130 405119 058156 825184 065143 365226 256144 556167 796215 470300 224297 191307 807377 263309 145500 400851 73450 3551 512 04573 57670 227141 742149 40770 140184 57374 31075 965122 171159 99944 516243 6823 592 425174 653160 389289 947241 308455 241411 926605 5001 100 325107 431150 416122 523174 83476 120150 010144 79898 416414 779101 990223 870295 551 994051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M220M240M260M280M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.88%99.86%99.89%99.89%99.89%99.89%99.89%99.89%99.88%99.88%99.89%99.89%99.9%99.89%99.88%99.88%99.87%99.89%99.86%99.87%99.88%99.89%99.65%99.64%0.12%0.14%0.11%0.11%0.11%0.11%0.11%0.11%0.12%0.12%0.11%0.11%0.1%0.11%0.12%0.12%0.13%0.11%0.14%0.13%0.12%0.11%0.35%0.36%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped