European Genome-Phenome Archive

File Quality

File InformationEGAF00001307748

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

221 086 405374 125 098471 501 697482 736 033420 341 529321 216 042220 074 903137 330 21579 322 00342 821 46721 987 24810 922 7165 381 3742 732 5471 485 328897 882607 737453 349363 070298 830250 976212 721178 827155 319133 318117 411101 60789 63779 99568 20861 87354 48349 85845 30240 82638 61634 64531 94229 49127 40426 58424 64923 85421 90620 54219 72718 54816 85415 94515 33615 08914 32713 88712 84512 49312 12211 61610 87010 97710 6459 7949 7449 3849 0888 7668 6828 2617 7127 2967 1497 0106 9227 1387 0407 0026 8306 6946 5276 1196 0815 9165 4865 4585 4855 4715 3915 2404 9915 0495 1194 7204 9315 0074 7674 4104 1434 0854 3884 3824 1123 8744 0223 8683 8003 7773 8433 6053 4783 5883 5203 3823 3193 0383 2093 2823 0953 0093 0223 0973 0342 8772 9282 8982 8722 7602 7732 6422 7112 6182 6742 4022 5252 6302 6482 6752 5612 3472 2992 3322 2712 2612 2182 1232 0102 0471 9631 9452 1432 1291 8931 8561 9071 8611 8031 7831 7391 7501 7751 6231 5361 6021 5061 6351 4981 5621 5981 5291 5021 5291 4261 3571 4431 3941 4511 3621 3091 3391 4141 3331 4181 3011 3571 3621 2351 1631 1631 2851 2911 2131 2171 2091 1481 1201 2421 2451 2251 1031 1451 1911 3241 1691 1831 1451 1211 1141 1071 1471 1841 1931 1211 0251 1001 0521 0299951 0561 0139479421 0531 0031 0421 0111 0711 0651 1401 1761 1061 1251 0841 0641 1281 1191 0871 0701 0921 0901 0801 0481 0391 0529929519919689479491 0109479338929299031 00599298897197397090399091490489389285881890889286695886084984784479482174180780773978677585285183778079087386782081981173381384179782587184680879579671271071873966367366175762166561865666159062059358452461953353757057960355852460654851652347447950155447552250647951644345450646045445145745043644646148347550249245844041845743143941241338040840139937537538337136836939639137434037834837837333638635838336133336740640841537240041034338034532736035736435739137335635733734934738344538939237940337740941148440343239644240042138739240644141442243939144139642243147243441343244044836342642539337637839436036538344145143038742035934835339336734633934935633637240538939340342044040539440936636835235333533433134631934035335037834931933032634034635533232834237936234536738837633136934636935036032034133834032134536937236233432132236134333337233341234735236932836435934135133336035930433833932836530436635035433835034432733830133734435533135733132034335834433035730835032934732433333630632135131830931733632132431031630933933333931030532228732030025728324626331526224725225727725525624624222721223520420622918619816619618319424018622120120725423326222722420119021118817918120518117414816518515613316518015514515915815015115014514314414813112915014610912814613714213115614014714912613514114814012312712513512813810713214213615011012313614216414512210710112011410110611499105109971141121181201161211041011021301151211158911611711112513710510311411611911512112111410811611310013011111911513010911298102107100107949110410397108105110111107113103139108119113999210910999106112112949611910112090108113130106110125119113107109931101081181111239586991021151251021169612411211110310811511211511095111981009210499105961041119396111109931111151211061091199785968699108109869492107711059411096101979810610611676951001111109587789979869882100749910174849210079999093868993867282979590689477747678887377788781787280769686778162736466757585647878697078548372716066776766746570716475725970676768687569675968586181898470588394826970717466687278817776698688907070588373797677887362676064576374645046 662100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 588 33400000060 644 007000827 976 8120000000000413 147 4320000475 660 86200001 082 147 02800001 810 916 85600008 236 380 99700510152025303540Phred quality score0G1G2G3G4G5G6G7G8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %85 459 280100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %85 432 55899.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %26 7220 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %42 746 56450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.6 %83 455 74897.6 %2.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6 %5 165 6326 %94 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 022 820178 95575 181255 23086 18792 162377 364120 07558 048128 79843 23338 234132 55049 48328 00465 66638 57841 97873 54758 62860 00971 046107 95558 67699 862151 72120 746329 21023 53622 83643 57135 59326 13142 81123 31222 86931 61638 12917 42758 582764 70246 45349 28071 26061 965106 04295 722153 466221 98833 18342 53038 06748 69931 00647 44946 77937 071103 68240 22068 50478 726 482051015202530354045505560Phred quality score10M20M30M40M50M60M70M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.97%99.97%99.97%99.97%99.97%99.97%99.97%99.97%99.97%99.97%99.97%99.97%99.98%99.97%99.97%99.97%99.96%99.97%99.95%99.97%99.97%99.96%99.97%99.96%0.03%0.03%0.03%0.03%0.03%0.03%0.03%0.03%0.03%0.03%0.03%0.03%0.02%0.03%0.03%0.03%0.04%0.03%0.05%0.03%0.03%0.04%0.03%0.04%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped