European Genome-Phenome Archive

File Quality

File InformationEGAF00002319501

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

61 310 103125 723 554205 002 493281 784 308337 331 243359 372 049346 265 809305 733 828250 244 578191 835 597138 872 47995 464 45862 874 61339 992 32524 727 82614 961 2298 967 2605 417 9753 347 9652 159 9891 471 1501 075 721834 625683 359579 387504 940446 535396 386355 263321 950295 733268 617247 061226 570212 746196 173182 858168 120156 589145 813136 481127 128119 955110 950103 68096 00490 06583 14577 47272 35568 64563 96160 83756 60152 77850 00647 38344 58943 17840 94239 13237 63836 48634 48233 45032 13431 31729 25728 49027 69127 39926 44325 04424 32023 31622 44722 29821 15320 88820 08519 93519 35118 82518 15917 97217 60417 27916 42415 95115 29715 08214 73114 78414 37413 98913 68213 46213 03112 44912 48612 21711 91911 61411 47111 30310 71710 52210 75310 29210 42710 18410 0129 7569 4989 6749 2329 2508 8798 7828 4158 2508 0318 1438 0747 9477 6467 5287 4937 6367 3727 4597 2167 0637 0776 9497 0306 9246 7476 7146 5896 5746 4096 4416 4786 2736 4366 2366 0566 0456 0965 8895 6875 6625 6905 6605 6095 4885 5315 4095 4335 3195 3055 0515 1655 2705 0055 1755 0005 1084 7394 7324 7134 4294 6774 4754 3864 2434 2854 2394 1674 1594 1334 3114 1064 0483 9943 9123 9523 7863 5263 5753 6903 6933 6593 7693 5533 4663 4293 3603 3973 3813 3673 3213 2843 2613 1733 3043 2213 0443 1362 9312 9803 0673 0443 0673 0403 0803 0202 9692 9882 9642 8472 8102 8632 7822 7492 7302 7132 6512 6382 6712 5912 5262 5932 6022 5682 3552 4572 3802 4712 4102 3652 2852 2022 3532 3172 3832 3332 2972 2432 2392 1982 2652 3012 3022 3272 2892 2172 3132 1912 2112 3912 2092 0762 0812 1282 0631 9992 1112 0982 0582 0062 1622 0262 0262 0211 9931 9201 9751 9941 8251 9541 8761 9091 9431 8301 8021 8661 7791 8021 7551 7191 6221 7271 7581 7411 7751 6381 7171 6591 6441 5921 5921 6421 5901 6031 5861 6191 6171 5511 5351 5701 5701 4901 5891 5591 5671 5481 5101 4141 4171 3871 4231 4471 4261 3401 3561 3651 3941 3311 3091 3221 3721 3371 2851 2781 3191 3041 3151 2171 2421 2361 1991 2571 1721 1701 2791 1651 1441 2111 2001 1601 1431 1861 1551 1561 2181 2171 1431 1121 1411 1231 1211 1651 1201 1081 0641 0541 0661 0471 0381 0321 0251 0411 0331 0041 0309821 0211 0271 0141 0291 0061 0239931 0259559959571 0381 0099781 0001 0419809641 0531 0481 0261 0291 005959960925960976926930959904916899943889928892851936938867839811854876856861813816839829802810885784850808810800839857835840793794824865843760758724736746722714734732723724761743754748745775783785668662651658673698720714715727722700686671691710726643675657623668685674662637614608641593631650658633636617662603618626684628625629666643680675674636587604617642652618628541590651588678614662601635565596592652622645568620559617609562535529529536509511519455494550529534547559504483458495491495519533481550495533582530520539520521534489511536519561493466547489527478533527471503476511513520501481531513474528482507529537508490491599427427527477459468453452453432395438430390456433400450408431448488470447487461455478443420454488487457490495443462468460440448462465433445452458492470472466467460431454440466444446401468479466425425459456423445429443472453444444439419441420414422372457492438465475438413405400416432446452453469428471443445418447393420457448452462445428387407402451419420431421429400411428413410379409391402382409397400386374391395335366355378386370386430392367354379380431398397402390385406415378361380391365376372357386376394382377361352413357388406386367383392385380366336382372368358359396354339383330328337325364336350339317318297356308326314380325311299360309315345294319323326327328382304350316322335369346322357316318338335326345302377345306330355323324338290338297299325313270279285286291272261279271265279315271251284318294296334319294295309247292286342281267315287336301318276283295288243297328313298293283324305288327297296309296264276308265276300292294285269303313257282253292296284245305263252277304310288308292266288299276300244261249257260262248271266270273265269273256267250264283279286214210254238245253253258260263256244246238256236258230239254214220238217222235219257255 211100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

4 806 745000000053 258 3470001 059 560 890000000000621 143 7570000688 452 77000001 419 537 10100002 891 517 70700014 698 815 78700510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %141 349 64599.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.2 %140 898 73699.2 %0.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %450 9090.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %70 983 75250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %139 119 48498 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.4 %11 904 8608.4 %91.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 408 744126 30974 194151 718109 006114 392137 970178 29575 481138 20063 02855 12487 54990 08644 954109 38978 18485 752112 757143 739152 752155 735190 964150 714252 602402 90022 865710 49434 71133 66873 93672 33031 32188 98635 44235 47764 24879 27320 162117 3841 820 34087 11090 286140 009122 642225 033191 794310 414453 11557 19272 45267 20885 72047 63379 60081 53362 404201 03965 645122 171127 061 073051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.67%99.66%99.68%99.69%99.68%99.68%99.68%99.68%99.67%99.67%99.67%99.68%99.69%99.67%99.66%99.68%99.66%99.68%99.66%99.65%99.68%99.68%99.79%99.6%0.33%0.34%0.32%0.31%0.32%0.32%0.32%0.32%0.33%0.33%0.33%0.32%0.31%0.33%0.34%0.32%0.34%0.32%0.34%0.35%0.32%0.32%0.21%0.4%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped