European Genome-Phenome Archive

File Quality

File InformationEGAF00003613914

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

190 466 332149 587 29582 196 64252 603 24222 433 76215 949 4466 040 1545 351 9551 931 8332 007 282773 738832 040386 848405 090241 131232 314165 283157 361130 656125 343107 71599 63091 74786 26581 89977 65471 90569 67166 72963 14359 52158 33156 79253 02451 80548 25946 76244 77642 81843 52940 85340 62538 56938 37837 05734 80834 61333 48832 58232 01832 12430 86129 93229 99028 75827 82628 23827 18426 09225 88025 29724 61524 46323 77723 47123 41222 90722 13421 89321 51521 21421 05120 56520 11919 67020 16919 98419 81919 21618 74318 68618 10017 81317 75618 14317 88017 41116 99817 09516 77916 29216 11615 83516 02515 81515 45015 64615 31315 16315 19914 72914 74714 57414 53514 22213 61414 35713 69313 71613 52413 75913 06913 56613 33613 36613 00612 90812 89012 93312 69112 50412 54112 47212 53412 14112 11811 97512 11611 39011 80311 69111 51411 30011 22111 17211 03710 87410 49710 67410 91710 79310 69810 50310 27810 02410 10510 46810 26710 20510 2979 93010 1169 8969 9039 7779 7259 5049 4569 6799 5199 2769 4189 2229 1339 1789 1658 9298 7968 9428 6318 5288 6918 6308 4308 4588 5888 5908 4458 2818 5258 2498 2168 2538 1148 1437 8598 1207 7377 5017 4037 5877 5447 6527 7327 4567 2677 1907 3357 2687 2687 2227 4157 3806 9606 9916 9187 0597 1256 8376 8036 7176 5946 6386 5706 7156 5336 5846 7116 7106 8966 7056 3966 6206 2556 2966 1986 3466 3746 3176 1516 0476 0855 8505 9405 7665 7935 7875 8815 7925 7995 8475 6315 8505 7295 6265 6135 6015 4065 7325 3565 2405 2965 3605 3205 2985 1875 2515 0714 8914 9464 9244 9575 0334 9504 9514 7104 7714 7504 9874 7304 7054 6724 5874 7324 6214 6664 6354 5314 8524 6954 4484 3774 4504 5824 4204 5384 5074 6504 3434 3354 2434 4204 2594 0974 1034 2304 1604 1674 1174 1494 2414 1544 1983 9883 9973 9154 0323 9533 9033 7603 7013 7893 6793 7633 7543 6933 7473 7063 6273 7283 5463 5853 5193 7543 5343 6313 4523 5413 4553 4783 4053 2723 2823 2973 3303 1993 3173 2543 0823 1603 1423 0823 1513 0253 0543 1083 0052 9692 9572 9782 9083 0152 8612 9962 9682 9322 9272 8942 7922 8342 7942 8142 7442 7772 7692 6722 5932 6242 6052 5682 5332 6082 6322 5362 7542 5292 6102 5552 5252 4762 5272 5042 3952 4682 3932 4142 4742 2062 3092 2712 2282 2522 3042 2182 2292 2072 1562 2022 1222 0602 0472 1392 0862 0931 9982 0772 0642 0321 9731 9661 9741 9961 9271 9341 9091 9081 8581 7991 9651 8011 8271 8031 7571 8111 7691 7961 7331 7151 6681 6621 6261 6451 6251 7061 7311 7151 7461 6801 5681 6321 5991 6351 5991 5861 5921 5141 5991 5801 6021 5541 5381 4621 4701 4821 4751 3871 4841 4591 4101 4601 4531 4421 4851 4371 5091 5531 4651 4561 4431 4591 4021 3961 3681 4521 3931 3761 3871 3331 3591 2881 2871 2901 2271 2541 1691 2101 1821 2251 2061 2331 1641 1311 2111 1861 2061 1521 0591 1281 1051 1261 1361 0701 0241 0621 0731 0951 0771 0871 0891 0481 0561 0699951 0561 0779941 0179829819881 0439529659799389629369699751 0009501 0059159869929289739809259939989399899448999618809408928498959098508698788788587778548188028017637827897928057827897917298428848188208087747537357367387467327027256357197387037197416907076966866926897136546196676146086366216226446196546435926036226277095856126676135686136055276055455615775916045695575445675575545295445555305785355165435235584965265215234865115205024534784594494564714274454744594494364714754464654564664465064554614504234184494604234674784624493933964244214154024374204184073983914683974253903914044043954044134004163773723763773703613663533693333613283353793813493453403533593273433193293433113253163563203293143162532833132713032932962902912692812442652822732562612802702552262592252442132222472562362232092492302432292002082422272122102222112132142041981972141952112061812282112072101841821852052082091962071881861731991661771691921541641741651661361681761671591671441731431611551431481411471581301421291671431491401241421191501351411261371321581371261281251111311361281331291351211311221071361311131221351139998111101122104117921219588109971089110292100961019310384120968597102106998710110389105848883875882816368706375775758667565677559726556636460736170706355585346685262555451505844435348506257515557566644506255573664475751474850455445374139404441322928232628253529333628227 140100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

349 41000000006 480 83900097 774 02100000000061 131 670000074 581 7760000142 076 9560000303 188 7630001 346 685 86500510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G1.2G1.3G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %13 526 49099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %13 510 90299.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %15 5880.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %6 774 23150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.2 %13 304 02898.2 %1.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

50.1 %6 789 95450.1 %49.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

585 15112 8738 42315 52311 15911 54713 73915 2929 98113 6186 7835 6347 8508 9575 22310 9507 6447 63610 84114 87916 40115 36918 89113 95621 07432 4852 59164 9193 6793 6106 8856 7884 0678 8853 5143 7855 7206 9942 24610 707171 4198 6997 52013 76710 54119 79722 62916 72359 9905 5447 4816 4767 9404 4406 8807 9165 65120 4085 67511 12412 197 749051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M12M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.89%99.87%99.88%99.87%99.89%99.89%99.88%99.89%99.88%99.89%99.89%99.88%99.89%99.88%99.87%99.89%99.88%99.89%99.9%99.89%99.87%99.89%99.54%99.8%0.11%0.13%0.12%0.13%0.11%0.11%0.12%0.11%0.12%0.11%0.11%0.12%0.11%0.12%0.13%0.11%0.12%0.11%0.1%0.11%0.13%0.11%0.46%0.2%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped