European Genome-Phenome Archive

File Quality

File InformationEGAF00004839196

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

75 459 086135 672 168199 212 567254 914 554293 048 997308 268 794301 376 455276 862 316240 652 230199 570 158158 595 984121 391 96090 016 48964 799 04045 438 68131 213 35821 024 98114 000 8999 203 0716 047 4093 968 0522 632 5071 777 4761 240 270895 996671 113523 643430 839362 024309 296273 663246 330221 212202 950185 910170 255157 751147 714138 559129 552120 556114 016107 06098 29293 29288 33982 89776 68871 80166 83163 56859 27756 19352 58950 66848 21045 22242 80640 88440 33137 94736 13034 50533 39132 26030 60129 43628 68927 50926 37125 83225 50024 44023 61322 86322 58221 90520 59019 80719 57519 39318 93218 41017 96017 68417 24016 86015 97415 46214 92614 72814 20613 78913 20313 19512 57612 33012 20912 17712 35011 69911 28611 03611 23610 65010 58210 28010 49610 19010 0649 8339 7239 9159 4669 3359 2408 9509 0449 0918 7878 9668 5558 5898 5808 3978 1797 9937 9508 0177 7297 7447 7747 5197 1486 9657 0476 9456 9736 6476 4886 5716 6276 8776 6396 5786 6016 4126 2756 1295 9656 0236 1636 0065 6755 5545 7675 6465 2375 3635 1095 3165 3445 0174 9864 9444 9194 9164 8434 8564 6674 8204 5974 6194 5094 4814 5064 5894 3844 2624 1544 1294 2074 2104 0254 0533 9443 9113 9213 7153 8873 7783 7103 6653 6873 5903 5873 5853 7303 6973 7163 5663 2903 2193 2783 3433 4173 2813 1603 2003 1573 2043 2053 0173 0542 9593 1403 1123 0563 0162 9972 9943 0142 9162 7612 7362 8792 7492 7192 7322 6202 6682 5982 6052 5842 6052 6112 5472 6142 5692 4992 3902 4112 4132 5302 4202 4622 3172 4182 2472 3082 2272 3662 3692 2502 3152 3112 2732 2332 2182 2052 2862 1822 1242 1792 0882 0722 0701 9972 0801 9721 9651 9961 9921 9971 9001 9921 9261 8991 8381 9291 7961 8881 8121 7661 8151 6971 7761 7671 7141 7411 7491 6251 7421 7571 6671 7571 6621 6841 6231 6281 6141 5891 5651 5961 5651 6451 6171 6821 6021 6131 5981 5061 4571 5331 5391 4881 5871 5281 5201 4131 4831 5421 4661 5221 4281 4051 4271 3951 4351 3921 4241 3691 3511 3961 3841 4181 3641 3761 3861 4601 4581 4981 4341 3951 4801 3571 3001 2971 4101 3131 2671 3881 3151 2381 2611 2461 3091 3291 2341 2391 1601 1981 2601 2481 2591 2531 2701 2471 2291 1641 1631 1611 1871 2601 2591 2741 2221 1981 1841 1441 1061 1061 0801 0591 0071 0349671 0041 0321 0241 0701 0681 0461 0871 0531 1041 0111 0301 0391 0011 0071 0219851 009981951987936907905981885902948899909880931967884844849892859966894968970900898846890856870836957879868936963892946896888880934936942965874974913916892979859802882905853894820867845853815861868871877846844855899872807881894888834763810771740742784816777753728766726740728749752721724751762715751729693766742748694727729753728782727725696786737735738762686715656715719743627672778710682692682714668673630660652664694653636657670646661637658621606595620603642606620656623600645711572571611592590644626620610594625593609608558533560614566558557534549598549570554584531511511503557576524580609576566573571541588519540556558567569516467473488546441508483461464465445429470453514492476430445434469470511482477472466487451469523502497478461445467472455488472465445434440446465419429456456438460429439480455396433434440400427443424389418410412385369408380395422445406441442423392425378425432408412433420398437421389400401367382394395402368403408395440367388384410391378362373367371352371371381365366385400380371336352344364400363350354341322408380388363359334387343381386344335319346363348326320315332317336334311341320331300342334351333345342339360376361368345347335350335336345334344328335343344327324349312310293315334355314310312293317316293323292287340299258323315313339285298291320286303323283301310308297300317274322297312314302299335330355346342306345302347297354376326389372335332341311344330320329360355314333307304327295354347353313295347317331311371299325338373393355341317358307313346342333326344362320326334336333349307310314322321298305321313303327330319326338325306327329358333301344298325301344321322335324333327346295319322302329345320351319353346342292335360361323333304321300339320307317309316288320289293286319326317284295305337292318302296302295301332286308321317276290282276275 178100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 369 679000000029 992 909000814 155 571000000000486 505 5110000556 229 31000001 277 687 57100002 787 813 89500016 633 129 28800510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %148 678 63599.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.3 %148 524 01299.3 %0.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %154 6230.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %74 794 31750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.2 %145 381 59097.2 %2.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

21.7 %32 410 12121.7 %78.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 178 665135 33583 392163 106119 185127 654142 376197 52396 656154 49170 07259 99779 59495 45449 679114 38680 49092 784114 940162 487167 564165 199201 240156 574248 110413 32728 938730 15740 34438 07871 75979 65041 10497 21039 58639 88061 72585 18624 188130 8641 791 29287 03680 895143 076116 584218 213190 564286 286481 74254 18372 70862 48781 61237 44880 48572 28949 427194 02152 967109 078135 320 697051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.89%99.89%99.89%99.9%99.89%99.9%99.89%99.89%99.9%99.9%99.89%99.9%99.9%99.9%99.9%99.89%99.9%99.9%99.9%99.89%99.92%99.9%0.1%0.11%0.11%0.11%0.11%0.1%0.11%0.1%0.11%0.11%0.1%0.1%0.11%0.1%0.1%0.1%0.1%0.11%0.1%0.1%0.1%0.11%0.08%0.1%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped