European Genome-Phenome Archive

File Quality

File InformationEGAF00004839152

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

119 800 772188 444 385245 224 642285 015 218301 700 325297 406 434276 073 632243 879 539206 451 915168 311 895132 887 600101 854 39776 215 26855 788 50740 076 32428 305 49019 732 11813 579 2309 308 0906 326 7524 302 0682 953 1032 041 2261 433 4081 035 816762 622586 432466 977382 068324 990280 233245 496218 587199 350183 148167 897154 508144 522134 994126 650116 509107 909101 50393 79487 58882 90278 56571 83268 59963 15159 18356 74453 96550 98548 72445 56443 58140 23638 94036 55935 53335 47633 93032 78831 19030 05329 18527 80327 10526 09625 08023 56023 75823 05822 72322 06921 48520 94820 55220 15619 19319 48518 32817 74816 94816 69416 43016 33516 08715 96015 55114 98714 38614 36414 08913 33013 13213 06013 07812 77112 40812 24512 07512 06011 72011 43911 64611 04611 12810 64010 88110 31910 36010 3639 85410 1079 9339 6369 3019 3239 1168 8308 8388 7288 7818 5398 2187 9928 0788 0047 6697 6367 4277 6917 3717 1537 1326 9386 8976 6296 6276 5316 6516 4646 6276 4636 3506 1676 2496 2335 9836 0276 0926 0615 9615 7275 7895 4125 6205 7255 4475 4195 4045 5375 3515 3274 9415 0685 1384 8154 9144 7404 7794 6995 0014 7624 6034 5454 4734 5594 4814 6794 4184 3844 4944 3394 2764 2604 1254 3544 0904 1274 0844 1254 0303 9583 9853 9233 7823 7013 6893 6363 5833 6763 5253 5433 5443 6493 4963 5693 6333 5243 5503 6033 5513 3553 3693 2203 2723 3203 4003 2943 1583 1712 9973 2273 0252 9092 8792 9382 8182 8842 8902 9172 8292 7272 7602 7532 7162 6942 6652 6442 6562 5432 5562 5292 5432 6062 4742 4762 3702 4432 4812 4332 4102 4102 3382 3312 3132 2442 1732 3662 2792 1622 3162 2802 1792 2082 2312 1342 2332 1372 2392 1632 1632 0512 0592 0412 0321 9301 9952 0052 0372 0162 0062 0692 0392 0492 0691 9581 8831 9421 9131 8671 8471 8211 7741 7961 6931 8031 7831 7001 6481 7101 7281 6831 7671 7261 7751 7251 7191 7201 6431 7151 6351 6671 6221 6841 6401 6281 6671 5821 5851 5671 6391 5441 5311 5721 5371 5251 5421 4921 5361 4811 4521 5231 4961 5011 5801 5181 4791 5001 5221 4661 4201 3891 3681 4551 4521 4171 4541 3981 3841 4861 4011 3931 4221 3641 3191 2571 3301 3451 3101 3271 3551 3161 2951 2361 2771 3271 2921 3011 2571 2421 3141 3181 2731 2391 2231 1931 1341 0981 1011 1211 0691 0861 1121 1631 1591 0891 1301 1691 0951 0751 0451 0391 0931 0531 0721 1011 0371 0811 0861 0371 0401 0531 0571 0461 0331 0301 0099709829489349979711 0041 0021 0149049609649719381 0419869889631 0219931 0311 0101 0399879759881 0301 0041 0159269349531 004989947931894946941912857862875903874877878943937921918930891928947991938968935906917937996919895861858926862822849840829799859792877750792828763754791784779763766812769815769773776750732847821773802777800816704699694705762741779727678764734718707717731705721692678691642632739706717662694742632654646688688636646669697685648628648651681619633645644597660600629613618615618609611622669628590611579569603553542538552559530559569586568568551603526588590583547573524587587576541579583559562554559551534554547537558575521537533518471517520479480487465492487516501508446478527549478502470507544468519488492475529519451436475493500480501469462453468485450445500429470446432496477465449448507446446471457462448460455472463426414424467428441417441421428428416397401420390424429408432408399395380403383375417377420398401387385407461427433379389400446379362387425414383414388376338419387378451432385364390369361384364378405392360394352380387359347386395419474387374360415414383362386381347392387331318344322353327308354339370323370353373334387329324335321345377327376387346347372324358359342361327334336343348310360330333318349364353349342326342341383331358290315338326301355284315312321309305305314313332319281321341322322328344334305290338344304282307292302299342308332309317279353333302349314287313316342306335310327318323312323303333308305304306287312312287321307313298307337308296336316313336322325329340321350285304307323321342338354324366334327351323302316330338341338316331347339364356371326372342370329322334326327319322347349347332321326293311349292296320288314333329360321353361332314321357310321290307349339304315280324301286285291306305285317271281318329296263305264291 732100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 420 352000000020 183 660000915 147 326000000000491 275 6530000537 832 37000001 196 360 59400002 707 857 39500015 375 542 31800510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.5 %138 652 76898.5 %1.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.3 %138 352 72698.3 %1.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %300 0420.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %70 349 73450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.3 %135 541 28496.3 %3.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

30.1 %42 400 57230.1 %69.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 328 418141 31988 357168 897125 434134 533154 555204 37695 981156 57271 07762 48988 73198 68450 430117 30782 74095 276124 673171 378176 003169 285203 909157 791252 596410 08728 909706 53239 94538 79076 21380 30538 15197 56940 09740 06766 03085 62023 722129 3071 802 47285 77184 690137 785115 849214 258183 927284 767445 47756 28971 60666 03480 86442 05885 54775 94054 411190 84957 954110 373124 978 013051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.78%99.77%99.78%99.78%99.78%99.78%99.78%99.79%99.78%99.78%99.78%99.78%99.78%99.78%99.77%99.8%99.78%99.78%99.8%99.78%99.78%99.78%99.85%99.76%0.22%0.23%0.22%0.22%0.22%0.22%0.22%0.21%0.22%0.22%0.22%0.22%0.22%0.22%0.23%0.2%0.22%0.22%0.2%0.22%0.22%0.22%0.15%0.24%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped