European Genome-Phenome Archive

File Quality

File InformationEGAF00002194614

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

24 951 88959 515 178116 569 797188 840 133260 701 833315 133 451340 583 861334 600 508302 706 235254 795 121201 331 138150 368 812106 866 35272 726 85547 697 34030 300 68518 790 62711 496 1716 963 8764 256 1462 685 9271 775 2001 242 799917 078714 324585 393495 743432 719381 692339 913308 424276 436253 125230 480211 157195 915179 809166 368154 789145 154134 959125 975117 688111 399104 63598 79093 17987 78382 56176 55172 80668 64564 97362 02758 83255 87252 23849 43547 55346 24043 42842 21339 95638 15836 71134 86234 37333 00130 93129 76529 17827 53927 33826 22025 20624 10723 34922 02021 59220 90819 97819 78719 21018 95018 53317 94417 35216 84516 50816 38115 81615 85315 13814 88214 57814 12014 22813 88813 47813 23112 78512 19612 32412 15911 80612 02711 63011 65911 54311 55310 90910 68110 58410 32610 33310 1779 8809 9289 4959 3569 0268 5618 6048 3578 1398 4628 3638 0868 1887 9257 9487 8037 5297 4747 4277 4857 2127 0617 0586 7986 6986 7416 7256 6496 6956 6786 4276 2526 3896 3136 3406 1005 9315 9985 7385 4315 6955 6345 5755 5635 3965 3605 3845 1945 1395 0754 8874 9344 7034 8454 6854 6174 7874 7154 6104 5654 6434 6334 3614 2704 3544 3594 2914 2074 2544 0744 0653 8833 9303 9313 9813 8253 8163 6743 5253 8063 7903 6443 5203 5263 4803 6783 5393 5183 4783 5453 3913 4943 3743 3383 4443 4103 2403 3903 3843 3563 1743 2123 2653 2323 0773 0493 2153 1063 0243 0493 0463 0983 1083 0912 9832 9242 8973 0372 7972 7922 7782 7422 7912 9402 8582 9352 6962 6562 7302 5592 7452 5612 5012 5522 3642 5492 3522 4782 4762 3852 3542 4082 2982 3652 3272 2302 2482 2392 1802 1712 2502 0532 1682 0582 0692 1212 0932 0592 0292 0601 9712 0142 0032 0061 9881 8311 9692 0562 0692 0462 0911 9972 0271 9121 9461 8931 8611 8791 8211 8151 9151 8371 7781 9231 8521 8661 9241 7811 8561 8141 8261 7861 8071 7191 7021 7801 7161 7801 7041 6641 6221 6691 5961 5701 6501 6821 6381 5911 6031 5651 5981 5881 5691 6571 5151 4561 4921 5171 4301 4091 4321 3841 4021 3541 3851 4541 3611 5351 3251 3331 3671 3421 3931 3521 3411 3051 2591 2951 3131 3191 2701 2921 2841 2881 2791 3031 3151 2881 2361 3281 2501 3561 3301 3601 2651 2201 2391 3221 2291 2451 2801 2621 2461 2661 1551 1711 2301 1781 1931 1251 2211 1951 2791 2151 2151 1931 2361 1671 1311 2061 2241 2051 1531 1951 1381 2271 2451 1311 1981 1251 0921 1871 1081 0921 1381 0771 0681 1421 0551 0661 0771 0601 0101 0689899891 0011 0161 0749931 0561 0291 0711 1321 0491 0251 0541 0201 0681 1481 1111 1291 0661 1191 0241 0481 0531 0281 0129919469479929899591 009973942954972995962918964900886954958946974950915861853898891876848887870882912864874942863882888839878850851893859806800867839828798840823827835867878833775749734770719757776705728790789762765706709682681685717641656680660671672621664648665663692680677675668618679674673633658652606612635634610609658574645625550533583554602530537545591571537592593592523535508547525537514522520550540535563509555559544532511540517558526564610618576552615550607558596613571550558532549596585553519538548515549534521537560535540533554534550525520523528519568484477544536510499524488524538486470470471438423459421464420447456417416465432434445407410426431396432427417382419429507426424418440393404418429422412443441423411422416411411427467459392431391424440425451444443427460441438421441430436434431428408457427440401452452430409414410429431431403441412405403386418366382391425376401408388427403405379391394388418397382462380400408443439413493479490473479437472459420495426463461459482452458422486476460463460404429422428492432456472443456456470423440483406368399448419443448437433454449432446422450461449452478470458481461442425466444406470441432450415451438404379374380406430424395377379429432429398421413412439404378404407400383375405401424404412424460415353375423420423411440422397382425406406374400400412430391413437427402412395383368410393446432422406398374396465458394429416428422398424462455422457419407414393428399407387397429419363444400416369402366392380382365380371338362351389346385383317376378330340359315346388349361337300356313343305325336295324304296302319300303341330323299350333339313326304284319333290280303288286297257251273276284273246267303277 435100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 687 374000000039 150 340000938 456 020000000000557 523 5450000637 199 15400001 415 046 72300002 961 216 28800018 309 933 07000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %163 985 00099.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %163 605 76099.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %379 2400.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %82 325 20750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %161 110 42297.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.3 %12 050 2227.3 %92.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 235 112151 34291 635181 071132 238140 775159 137214 42495 530164 22473 13763 34593 745107 25355 236127 17583 28792 837143 048207 741206 471187 190237 607184 672291 518474 86127 756806 97241 51040 60983 53886 56839 300103 26041 46841 35769 05090 25124 333133 9022 010 09896 49893 354157 210132 061250 894214 961333 557528 41459 14878 95067 73388 93142 89279 12979 66857 533216 76858 783121 292147 962 353051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.76%99.74%99.76%99.76%99.77%99.76%99.77%99.76%99.76%99.76%99.76%99.77%99.76%99.76%99.76%99.79%99.77%99.76%99.79%99.76%99.77%99.76%99.66%99.84%0.24%0.26%0.24%0.24%0.23%0.24%0.23%0.24%0.24%0.24%0.24%0.23%0.24%0.24%0.24%0.21%0.23%0.24%0.21%0.24%0.23%0.24%0.34%0.16%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped