European Genome-Phenome Archive

File Quality

File InformationEGAF00001767676

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

359 489 781234 739 158174 439 112137 593 938112 255 78593 951 99879 971 90169 023 27860 174 08652 996 64647 074 34542 081 26137 825 55034 122 62230 907 94228 118 59125 699 66123 524 30221 575 68019 875 08418 343 79816 943 98315 679 06514 553 28413 498 16412 557 14811 676 02410 920 79510 188 6179 514 7528 908 6318 356 7207 837 5187 345 5666 881 5236 473 8366 092 8495 741 9805 416 5445 097 6904 828 6484 560 1764 299 2114 064 0763 847 6483 639 0283 450 0253 265 0153 089 7892 935 7272 785 4362 645 9612 506 0692 388 7162 273 2162 165 1062 058 4801 956 6691 858 1911 772 9321 690 7381 611 7141 540 6571 469 7191 402 6821 340 4641 281 6721 228 8381 176 2741 125 7351 080 1141 035 436992 048950 015911 663872 367835 009803 703767 264739 420709 517677 686649 391624 095601 065575 843554 356534 005512 286493 185474 918457 875439 221421 737406 965391 423375 847361 462347 076335 911322 496310 389300 209288 323278 644267 714257 581247 209237 549229 367221 157213 348204 580197 369190 601183 561175 735169 356162 777158 447152 066147 998141 725136 645131 755126 862121 638117 443114 468110 809107 949103 995100 97497 73894 63691 69188 79486 08183 29981 33578 48376 77673 86171 62469 19367 03065 40963 31462 12859 65457 78756 53955 77854 39252 12550 65549 16847 59545 89144 68843 61842 18041 02839 90538 60237 42736 84235 56934 87134 21033 33232 50531 72830 93030 04728 86628 27227 72326 92526 30325 45525 19924 42823 62223 34522 69222 10521 48321 08220 36420 00019 53219 01418 61518 32517 67817 12016 70016 63115 78815 69315 19514 78314 45614 18213 59113 47112 96912 72512 40411 76811 65010 95910 80410 49310 3029 9729 9289 6099 3359 0748 7668 6268 3988 3398 2408 1478 0187 8647 6087 3317 2906 9406 8326 8666 5036 5626 3856 3866 2926 0906 0005 7695 5705 4765 3075 1745 0784 8134 6844 7874 4794 5974 2184 3074 1634 0673 9293 8273 8353 6063 5603 4523 3443 0283 0272 8592 7892 8062 7332 7882 6582 6182 5092 3702 4052 3952 3832 3232 1802 0812 0742 1412 0892 0932 0021 9111 8631 7411 6471 6991 6581 5891 5151 6251 4651 4431 4251 4631 3551 3101 3031 2991 2401 2121 2011 1941 0881 0511 0869389951 0019808968728859159258778458638007818057447317737066936526426306306325735865505625414964974724554414374324784804634303973493963773563253403093163203153363152922983082912582992653022883023352902723002532612772842892592872762412572402152382462222102071981911951801721591701711541621611531621491521351431691491241161171058911611010413010590104108111828481787682889086917510982817761736670497150697474816986606480605168556664704358497071846646536952596848405658535260785270496142434332474638454141424142454336292931344041515331404542354847353536413721283138153027291551523819619301422172421211115122017112611139121064106875273645633535251212124633241223111132412412135121121239111111111112111121111112111111112111121212231211111111112148921212111112111121211111113211221134133232221111111111111211111111111111632100200300400500600700800900>1000Coverage value1101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 119 51100000041 305 672000677 004 6060000000000420 557 4290000446 070 28200001 392 070 66400002 834 764 015000016 309 805 85500510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %146 415 54099.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %146 351 17499.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %64 3660 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %73 253 96750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.1 %142 289 98497.1 %2.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

19.2 %28 087 11919.2 %80.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

4 581 319122 37053 120173 48257 81261 949253 27581 98665 74595 74742 25535 215112 11042 96431 85160 78337 47340 71370 17547 27244 60661 53282 19354 170101 351134 57325 646331 86426 02522 68068 50737 44245 13745 40329 36626 54446 28139 48324 12664 9141 200 19375 74381 419120 30492 791152 112158 612169 886409 14740 21768 65044 42279 49835 50781 38149 01441 858141 03739 98194 270140 360 791051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.95%99.96%99.97%99.96%99.97%99.96%99.96%99.96%99.95%99.95%99.96%99.97%99.96%99.96%99.95%99.92%99.86%99.95%99.71%99.94%99.94%99.95%99.59%100%0.05%0.04%0.03%0.04%0.03%0.04%0.04%0.04%0.05%0.05%0.04%0.03%0.04%0.04%0.05%0.08%0.14%0.05%0.29%0.06%0.06%0.05%0.41%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped