European Genome-Phenome Archive

File Quality

File InformationEGAF00008176814

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

8 766 6206 343 8835 392 0224 892 1464 534 0014 334 4164 233 0024 261 4834 411 9454 711 2185 185 6635 875 8186 750 1927 858 0169 180 74010 759 33812 537 12814 503 13816 628 19818 921 15621 321 44923 822 13026 404 69128 980 65631 579 93034 162 02036 699 38639 164 04641 643 20944 075 89646 474 50248 845 56251 142 41753 408 50755 669 69657 873 19260 052 68462 147 49064 203 36366 123 94567 937 52669 584 75271 105 14672 395 22473 466 62174 258 09574 777 37175 013 89174 886 08374 469 07373 653 25372 496 06670 956 10569 126 33566 967 38264 523 36261 852 57758 903 46855 821 47752 620 28549 305 12045 922 37042 550 00839 214 37935 933 61232 757 71329 692 08726 801 40524 030 66221 470 20119 065 24516 887 23914 883 03313 051 42911 405 7569 917 2958 605 5737 435 1666 406 6845 502 7484 708 3334 025 1973 431 4862 920 7102 484 8782 113 6621 794 9851 532 7951 300 7631 108 433946 414809 098693 988598 864516 478451 791395 220347 740309 109274 354245 202222 039200 510183 470168 772155 107144 845133 935127 520120 929115 212109 471104 29098 91395 07991 23888 04684 82181 48579 98177 55075 70473 02171 57768 92466 91365 43763 21860 85959 49258 93257 65756 13855 78954 17452 29651 86350 72549 04848 27346 82945 97244 58543 56642 83842 18540 66339 63239 12738 46436 52436 45035 56135 04834 21433 21032 81331 68130 49929 93829 38228 40627 91627 05926 61426 35325 78425 66224 38724 01923 97623 28122 63321 78921 42120 95520 49120 23319 84419 38718 57318 49118 25717 80617 60616 94416 73416 97416 43916 22415 74515 34315 06214 91814 52314 22514 07714 01513 62413 44813 24913 03412 73212 55412 39612 03611 91611 91811 66311 28511 11411 20011 09311 01610 67610 62310 42610 35910 3539 8039 7719 6269 4359 4979 2209 3378 8618 8658 9008 8478 5318 2448 3888 0617 7957 6417 4797 3427 4547 1697 0256 8976 9816 7936 6436 7256 7386 8026 2576 1625 5625 8075 6955 5075 5675 4695 2345 3095 4045 1995 1254 9954 9304 8534 7834 9274 7914 7614 6654 6094 5494 6434 4954 3314 4214 3434 3734 2344 1453 9434 0804 1874 0614 1544 1714 0404 1614 1213 9684 0063 9064 0723 7373 9163 7163 8133 7283 5793 5793 5473 5943 5063 5523 6033 3463 4173 4983 4693 4373 3703 4613 3013 3313 4563 3923 3263 3453 1883 0253 0143 0493 2153 0382 9912 9302 9652 8622 8442 7592 8282 6962 6272 7452 6472 6432 7082 6632 5012 6962 6082 6702 7592 5842 4352 5492 4892 4832 4272 2542 2762 3792 2332 2752 2662 2602 2612 2142 2992 2732 3572 2132 2342 1782 0532 2512 1512 1132 0472 0692 0942 0982 0192 0302 0102 0781 9912 0642 1522 1562 1132 0242 1681 9611 9552 0252 0141 9591 8581 8871 9461 9071 8511 7011 6101 6841 6921 6201 7201 8181 5711 6241 6841 6741 7031 6811 5931 6111 6011 5451 5431 6061 5481 5211 5081 6921 5691 5501 7311 6351 5931 5531 6021 4391 4821 4251 4271 4751 4491 3161 3811 3581 3951 4491 3581 3621 3351 3341 3461 2881 3541 3931 3271 3591 3381 3791 2491 3401 2971 2821 3411 3331 3101 3611 2491 2451 3061 3241 3341 2291 2941 3011 2631 2991 2751 2791 2391 2671 2341 2401 2811 2961 3131 1801 2171 2321 1981 1801 2191 2341 1721 1941 1591 1771 2091 2791 1961 2471 2041 2791 1991 2051 2261 3621 1461 2151 2251 1171 1481 1621 1791 0901 0811 1041 1071 0891 0631 1001 1771 0161 0319961 0361 0819801 0529831 0089561 0351 1221 0551 0661 0191 0611 0701 1101 0069991 0371 1459971 0671 0391 0381 002975937997929955934964982981924907926979928980950997996890942925877889920889875911857966892936900887856848882855874870884896868778848828891843908812811844860764832879872895892807837836783781764753843824767755764767732746748763804717797763687770747784764809709792782727837724731750819775792793853820799726723778746696717727697739719716691650647614653672614678675622654676679619671676644585625580726643568658670625630621615618579660632666682576604680664662629669637738690632664604621561600609628637566646569557619584562578589593550625610613579622618640657619618599641624579619561623596543552574575576626604600559602608572577599589600561563555558623615589578639628637603608569679628709629633607558625548607565553588634621638544605603540581537586552623626547529567572578566583529544572596517520535562566558608581556563558599535528568573479566538573554537592512539565581541516508537580563512583484564516540490584572600539549620544592596568574521553579553522554524521540549525511525501478447500503489503467442471466506468437461495471451430399469436468420447482480467474494463489455448411459488427466423442434452433405426412459456434434420431456466474438441425443466458467439437424398436369401405396398429417370422401401410392379444408393403427401408415404387381393380345390360402381404382384379366346390403389379382388368329378327362355373372391343339355358388373349353341358354400382409321346365352363292 885100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 846 38200000000003 877 660 91800000000000006 111 939 38100000000000125 336 230 46900000510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G100G110G120G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %894 447 53299.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %892 953 46299.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %1 494 0700.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %448 114 82550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %876 381 80697.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

15.1 %135 129 20715.1 %84.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

22 327 070382 830284 340494 790338 216363 479465 147677 546517 238378 211185 295158 519242 257231 569178 475346 487219 943267 678270 724313 673320 051397 315561 940408 718705 5671 117 41492 8922 718 623108 71898 807295 498191 621154 613257 387116 321106 993181 752205 98781 710364 6338 055 285417 017304 218640 502520 941843 9821 111 897759 2032 229 972182 606337 294219 744364 301162 069362 966272 423199 111925 481184 917460 844845 174 342051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.84%99.82%99.84%99.84%99.84%99.84%99.84%99.83%99.83%99.83%99.83%99.84%99.84%99.83%99.83%99.83%99.83%99.84%99.82%99.82%99.82%99.83%99.62%99.6%0.16%0.18%0.16%0.16%0.16%0.16%0.16%0.17%0.17%0.17%0.17%0.16%0.16%0.17%0.17%0.17%0.17%0.16%0.18%0.18%0.18%0.17%0.38%0.4%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped