European Genome-Phenome Archive

File Quality

File InformationEGAF00000163521

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

7 661 89120 885 74644 771 22877 270 755111 843 613140 913 555160 277 291169 514 970171 923 050171 644 067171 055 903170 881 248169 828 201166 554 948159 757 189149 081 457135 153 532119 031 351101 998 68985 263 06869 841 18556 244 22544 685 11135 121 91427 396 75121 242 35816 431 94612 679 1319 794 1207 591 2465 921 8624 641 9573 679 1162 953 7302 408 1511 990 4631 682 2661 437 1041 249 0131 099 934981 088880 617796 480728 191664 729608 029559 156517 695482 402444 560418 176392 043371 885351 899336 157320 599308 168298 269290 420282 062274 915271 976267 897262 693258 883255 712253 402251 606249 474246 696243 800241 897239 375237 038233 201229 190223 002220 107214 975208 506203 414195 828189 903184 047177 437170 794165 166157 995152 133145 316139 021132 570127 195122 010116 681112 143107 060103 05398 65195 13591 54487 98685 13081 92979 25575 95174 29870 94468 85166 41664 36762 52760 43258 83456 86356 02154 47552 93351 69650 57349 67348 99247 73947 06246 13045 52744 41143 61743 09842 66842 02141 43740 71440 09339 75439 16738 40538 07937 38336 92036 40336 18335 20134 68634 17933 73533 51732 85032 22831 77831 03530 77430 21530 09929 50229 08229 25029 06828 28127 75127 18626 89526 62226 24825 92925 84825 37925 32725 07924 93624 64924 77824 66924 60124 08324 07624 02423 91223 68423 95323 92023 75723 48123 78624 20623 77923 93823 81824 38923 77523 95423 97023 96324 16724 12723 97324 36724 17223 99623 99623 86223 91623 44523 43223 34723 09723 22022 55522 77222 55022 62222 33622 25722 06521 63621 57621 41221 29620 91220 60120 30619 98319 73019 41019 33118 89819 08318 64118 56518 27818 16717 83618 03817 54817 59117 38817 10917 21717 10316 96817 17416 73116 28616 15616 36115 98116 00915 85515 38815 23214 99714 94214 82714 50014 75414 39914 44214 19614 24014 00214 09813 73613 86513 44013 47413 29713 03712 82612 34512 53712 04211 76911 81211 59311 27111 15210 97510 75710 51410 11310 2269 8309 9249 6459 3379 4449 2629 1828 8848 7248 6398 2258 1808 1657 7957 4597 4467 3827 1456 8746 9356 7236 4586 4766 3996 1715 8785 7765 6415 3765 4025 3625 0664 8504 7354 7214 5274 2894 1413 9853 8573 5753 6133 3563 4373 3013 3083 1743 0873 0922 9202 7972 6912 7022 5272 5142 3022 4262 1672 1582 0462 0781 9781 9811 8311 8011 7961 7761 7211 6871 6371 5981 5651 5471 4781 3901 4061 3461 3631 2781 1961 1771 1791 0831 1111 0241 0541 043990950914911951880922901852857867835835833777765730770675783697670643696646663634606623627616605570562553578578562589587574538557496488506528494502487482494503500527550515540520488490509499496471462463473459443447436440441473461474473462455468440448452441434450448395417414489443407392406442400493451451447414419407389449419401413430413447442409431447397435434413430448452406411394436421411406390384424423405420436373430341389355343344406336357392335404395396395376356367363392358375341372350383361362371391366348347335299350328363351348336367363365383371367324404323336343323329328351328313340344325349345373352314357329354328316279304287289273311316312321333327312301330342303297300307317319337310325276314307324287318290309320300295305301303263269307274277282303287255298296280261246260274307307279265301256312269274256280258267254260271297269295291242271279280275258262258253250247257253229280272258292291305301276265249271259317230266264272245239263253253280263261259238254237233271240278248241265243273280258292252282269249261297291271270253245245227264272277263267275236251269258245231247231252225241274236214223216246219226247232229255243238246232269245236263246230250237262261238236229250218257242222236233225262236249231235254226220247212249210224193203229192175189204218228200222180210196211213195182217190201204212207199229218206207193229201210220203229222203227214182223187216219203204202184209191219208222205211198199192205211208192217198192188216186203217212224204226201202179205167195187193186192170167192167181174175184187191197160192190181192166170182166168165164160163148150169140164182160156155159146143181193190175176167182184155179176153158168192161183163164171159166184159173159163145134185183172174151150175153138151137142157143127143169156149161167139143161150135149149156178167131145154113148143150138129164139156148138128124127131132128140123124112113122125131116134116168 294100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

162 952 773764 7341 650 22210 040 37913 723 70215 784 356172 424 17669 098 24124 542 75418 989 675423 030 395355 123 348156 139 112121 815 57016 692 77521 994 25226 762 10260 028 540119 629 449240 098 342399 074 568458 172 728378 549 280430 960 295345 752 130201 938 021338 904 980233 881 636523 477 548352 159 944722 689 060704 302 0001 090 356 5241 543 350 8411 682 380 2842 141 087 1474 452 054 4937 437 672 6159 927 963 1345 267 628 4031 112 991 072401 278 590104 404 6886 112 82717 189 0950051015202530354045Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

94.4 %399 553 99494.4 %5.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

93.5 %395 592 85493.5 %6.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1 %3 961 1401 %99 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %211 528 08450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

92.8 %392 422 74092.8 %7.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

1 %4 096 6741 %99 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

36 586 443112 59854 219169 59851 05480 56068 15999 74580 195625 740317 966212 262593 879131 437127 0921 334 493166 8751 790 264346 074100 194581 74127 553317 7881 449 04019 085398 31719 01519 46319 34528 154 12929 53520 15222 48628 87824 93833 186877 4477 379 87368 84239 274109 49488 84043 134178 89660 06282 950512 828105 272139 360126 164254 17486 936245 050186 290266 308414 644834 258336 742 574051015202530354045505560Phred quality score50M100M150M200M250M300M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped