European Genome-Phenome Archive

File Quality

File InformationEGAF00000488558

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

128 490 56340 750 19110 984 7254 158 3411 338 613618 333299 641193 845142 792120 393103 34991 93485 88978 58573 08268 17864 97561 69358 20955 50453 45551 43149 27848 71746 49945 33543 62642 20141 84340 95439 26439 15138 57037 33336 21135 29234 71433 51332 54432 77831 66830 68530 64230 09429 46328 61727 85227 18326 22925 51825 15424 85924 18123 79923 33822 92822 81922 15321 53721 37120 88020 01219 77919 63619 17617 95217 74117 13016 63016 30415 85615 88815 35915 04014 62414 58514 16413 97713 57512 81612 68512 38611 77411 64611 42811 39610 96210 73410 43810 2649 7419 5549 1708 9158 3158 1148 1297 9037 5067 4007 1466 9516 7196 5176 4876 2446 0375 9285 7015 6865 5705 4495 4295 2454 9024 8174 8074 5744 4784 3434 3114 2774 0554 0193 8873 8103 7023 5243 4893 4203 1763 1963 1883 1482 9193 0692 9092 8572 7522 5592 6332 4462 4842 3912 3322 3082 2462 2622 1052 1532 0682 0281 9101 9381 9301 8091 7401 8081 7381 6921 7791 7741 5911 6121 5781 5411 4991 5321 4611 5791 5701 5081 4831 4991 4661 3661 4191 3611 2971 3221 2351 2631 1471 1821 1771 1471 0671 0831 0641 0881 0101 0221 0079619259549639098998778508528207848038337977927377337708027177037036987077156797336577267136636896145695875435745435415575495255535535355164954835125084874734714604944464314274334493524023794163643814174283963943713473283253103383343263133533333173323302963232692902582642572702612812722772552522472342152292032102312332132362202122202091792181941951961911981882011761771701871711801761491571281371631311221171501211041211131051021239612011910793105918492911067594979978888972848071707264737472779070787384937154635459666660717158475444636363524045514441393750384046343334272633223929312830242427222425373631303137313935462330372926252828373222242723242325322631344026262920272522432733383322302524272522111315171618251119127171112142119131396117111661281510181413915139781199108118278681311117476322232324183331262482232111121211221111321121150100150200250300350400450500550Coverage value1101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

72 52400001 301 283682 6223 708 2032 368 043605 4451 440 242586 895599 278956 345379 6771 205 664827 8631 065 7191 929 782870 9641 819 1911 050 4271 708 7082 509 6342 470 2712 831 1194 800 6734 250 5314 028 3874 611 6139 126 05913 599 5879 106 93714 660 64624 796 48335 054 89218 150 30342 614 60032 129 52648 015 93056 687 65589 006 27900510152025303540Phred quality score0M10M20M30M40M50M60M70M80M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97.6 %5 749 49297.6 %2.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

96.5 %5 680 14296.5 %3.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.2 %69 3501.2 %98.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %2 944 20050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

90.7 %5 339 28690.7 %9.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.9 %580 8359.9 %90.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

402 6802 9899343 4181 4372 1944 2423 5173 80720 60911 50910 51617 8068 7765 47638 5056 71025 46611 8813 29620 7792 31210 36648 9742 83318 3331 8471 6151 290521 6611 4501 2581 3181 8201 2681 84031 732359 9803 2281 8124 7503 8602 4449 2524 5865 88419 1869 3749 34611 46417 6007 63419 90613 83818 48427 19846 3744 035 736051015202530354045505560Phred quality score0.5M1M1.5M2M2.5M3M3.5M4M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped