European Genome-Phenome Archive

File Quality

File InformationEGAF00000644146

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

156 785 07130 512 4399 727 3376 585 8645 107 7004 358 3413 854 4963 481 2903 201 9772 969 1562 770 3612 598 9512 456 6602 326 5062 206 3492 092 4971 986 4561 891 2821 797 5281 709 3981 622 8881 544 6761 472 4181 397 7741 325 5181 259 4971 194 4041 131 9371 072 3511 017 290966 007914 565870 398825 040786 227746 921708 620671 424637 328607 230578 233547 521518 611494 553472 375448 293427 756407 675388 296370 168353 633339 394323 045309 982295 985283 193270 095257 683246 846237 267227 302216 990208 955200 713193 497186 740178 576171 995164 545159 260153 373147 240141 503136 135131 019125 995121 698117 265113 770109 790106 536102 10299 15695 20391 64688 62085 64582 70180 22777 98474 91672 10070 54867 50165 58163 63761 79960 10058 26056 64554 57653 18551 33949 89148 21646 95345 56744 21443 03941 79240 48839 64738 14636 84536 08235 41634 66833 46732 84532 35231 10830 41429 63529 04728 63127 49926 59225 89625 52724 94424 14123 88423 05722 59321 98221 41521 23720 54020 01519 43118 79918 39818 08417 50817 17517 05916 61516 20215 97715 46515 12914 64814 28214 01613 61313 37113 05812 80012 34112 12111 83811 47811 48811 05710 99710 57810 50710 0929 8909 6599 4219 3989 1898 9548 6628 4968 4148 3458 1557 9537 7537 6177 5127 2396 9926 9456 8416 6476 3716 2486 1746 1205 8265 9905 6885 6785 5775 3355 1875 0445 1154 8904 9134 7884 7504 4954 3114 3354 2794 1343 9774 0944 0314 0823 9473 8333 7623 5733 5813 5533 5613 4763 2883 3383 2143 2053 1553 1693 1353 0373 0352 9852 9172 8442 7882 7082 7842 7142 5422 5482 5622 4732 4182 3652 2922 2282 1692 1852 2652 1312 0752 0852 0122 0651 9581 8281 8731 7601 8171 7661 7341 7051 6661 6621 6981 5821 6281 4971 4691 4791 4601 4251 4111 3191 4541 3201 4071 4021 3511 3561 4161 2441 3101 2361 2811 3131 2271 2671 1931 2391 2721 1921 1511 1561 1641 1981 1191 1361 0971 0811 0861 0671 0629769981 00594396997593791796893391597892895294289989794694488188788991489988296284483087280184578276480675774180681876170273377077075570774568069368771664969269669766866761963964366259960055856358558055755456154751151349252949451447247750046548042141949349045238342840442040745844243845444244038837740034438440137739440139737237835032333035332431228229531731732331426932529626730331325623327124426226124225328125427725824322526326225825624328128026024326522124925424923423527025127122922423526023618823220924122422622224822120423620123322922022722223624022722822020123721721218517217319522121618419418417118319320617518221418218716616917218819617416916718216620116818818221717817418717816316617717416514815114914417416115015215215415713013717312113814414516013915314512513211912114613413512211411813011913013512313212212611713614313112511111813310812614013512513012712513413113114213012812012012112012211412610813011414110511511710512611610810811711696108104931079392928210585889895919576979011479849889789691859885817586948177798410783768792957596869873857971807778688469834666648386737772668394618376677367728163647460717179577478776781747265677365696278697253605547624858524050443952534452424849505050485238415545373546384640404357343244424236354539234025253833433043223243373835283732444044254246323135323934404033472537284132423224202733303735372636393643313229283135323227412935312329442535312832311617262521252826242731232435392724262023182133333523282919292524352331232224171910171720172217232318182216182427291819213131232529252924292012192620242218182322182115251616181726211818102424192217171520182517241211202015241020181614181920231517151613191623171314231391510151014916151561413152012102316168622156 056100200300400500600700800900>1000Coverage value101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 183 90025 593354 882450 181696 080309 4313 255 6851 189 786641 3737 377 9375 130 3573 402 8362 210 4642 493 132434 9651 141 6131 733 3421 981 07120 382 3955 621 7958 596 54412 964 5214 787 0406 415 1096 557 02025 814 98510 235 4549 566 03219 607 46919 700 56720 369 36729 381 47628 680 52348 342 18152 087 41370 291 39192 734 202123 765 887136 262 334210 098 766311 572 304303 875 239417 081 125226 953 642157 075 33042 794 39634 183 6694 942 2033 457 1062 045 987005101520253035404550Phred quality score0M50M100M150M200M250M300M350M400M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %33 077 66799.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %32 974 02698.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %103 6410.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %16 668 37450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %32 886 51898.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4 %1 320 8054 %96 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 910 5453 0002 8685 5243 6799 4357 79316 28113 24634 95229 7769 03366 25214 79011 697144 21325 74461 90452 98321 022107 6371 07766 363235 1541 20211 5401 6441 3771 542823 8304 0303 1284 0484 9225 1387 146144 418465 63913 2783 40023 89612 8121 70235 5962 0622 47895 3062 7404 9004 67811 0225 94819 69219 36430 26655 592132 57226 524 872051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.65%99.72%99.8%99.44%99.72%99.81%99.64%99.67%99.54%99.5%99.79%99.83%99.78%99.82%99.54%99.61%99.69%99.69%99.8%99.79%99.64%99.64%96.17%99.61%0.35%0.28%0.2%0.56%0.28%0.19%0.36%0.33%0.46%0.5%0.21%0.17%0.22%0.18%0.46%0.39%0.31%0.31%0.2%0.21%0.36%0.36%3.83%0.39%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped