European Genome-Phenome Archive

File Quality

File InformationEGAF00004921728

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

85 225 173161 927 387240 681 654301 927 741333 636 216333 792 505308 668 552267 255 359219 093 369171 516 963129 218 70194 365 05167 083 94946 692 24131 971 61921 670 11614 606 9789 835 9486 647 2884 530 7933 138 3992 215 3101 590 0071 167 387877 492675 189535 305434 398356 435302 543260 416227 295202 862177 379159 738145 072133 011121 358110 417102 44693 94688 27581 80876 35970 74068 19163 43758 64855 71153 24251 55348 66446 87944 61242 82940 85339 02736 80635 54834 12732 79332 07030 39429 21228 04527 29725 91125 09024 72323 35322 44921 91421 15020 91619 64419 19018 42818 30517 43216 89216 24416 19415 36415 08814 94014 75813 92413 53213 39513 14412 63412 57812 28512 16611 52111 56111 13310 68210 66010 27710 70310 14410 1129 8609 5849 5059 2899 1058 9288 7188 3048 6838 5048 6228 3097 9767 8757 7027 4467 3257 3047 2987 0946 9866 3486 5136 5686 5456 3996 0676 0245 9175 8575 8865 8285 6835 6355 4155 2105 3235 6045 4595 4885 3345 2375 1245 0004 8945 0284 9174 5854 5494 7184 8254 7264 2084 3904 4004 2254 3654 2844 2244 0623 9644 0403 8573 7833 7533 5893 6713 7333 5053 6163 5093 3623 4903 4133 2613 2523 1963 2873 2313 2993 1053 1183 0893 0362 8182 9172 9132 9692 7762 8562 7352 6012 6942 6672 6912 4932 4862 4142 5682 4282 6972 4632 5692 3732 4142 2822 2152 3092 2652 2762 2802 1772 0492 2392 1902 0842 1342 3332 2682 3542 2782 2702 0371 9741 9901 9942 0491 9951 9421 8731 9211 8931 8371 8641 8181 8291 8451 9041 8271 9072 0141 8841 8491 9051 9741 7611 7461 8661 8161 9061 8131 7711 7421 7961 6481 7771 6251 7081 6331 6551 6791 6211 5791 6101 6331 6071 6961 5251 4781 6371 4711 5281 4981 4441 4331 4451 4491 3561 3431 4021 3561 3781 4171 5181 4291 5041 3431 3411 3621 3651 4431 4441 3421 2751 2421 3021 2651 2431 2201 2281 2111 1431 1541 1851 1661 0451 2531 1351 1301 1211 1371 0871 1051 1151 1541 0831 0511 0411 0941 0731 0169791 0071 0071 0281 0641 0601 0391 0831 1001 0439961 0729851 0421 0181 0271 0391 0389939971 0041 0101 0531 0039479521 012950954926833826875826895788857835919860864794862883884788807751831822778755820870784921812785766797713719809756679780701743789681720737725754808760744738721805753847687705763717746804755743745720766754739704763699687700648716705708767721782669716687722766702734680792743782722717687684703665727618626652678630650649700723629698719678659684716673638671707639638678623711772749674679681657668638605639674683701652586616665641609579592601593649586573573693667620606587551555545564556598577614588581595561555580538577600566572553540581680609556580629545621583551553519528548555515564544556547575555565482604497544562579485476510538601583542550463544472497577538494509497473504540507511493505515488535471490486498525508491546546488513531473511573540499520515550538507475447457440434424490439470493484426497447489502491457462455459495487482471459448447435455420425416437440398413434450471443421448465444499447474457453451419415422435452448438449448441433406441436461405417396396410442426410416414421442370426402424382445422372414372390392394428356388386396407408408409388407381430490470386349363354380382385381392385360370357337364349364387389351367365320364334339420375420411395387380356372385398400367401349381334365361341352351346329452357346345366349333340344328353374336341380324339342325376356360367348351327346379378366387351349323289305341292313297344285316312273306271301305296267278260284266264276262260266291255284276255257322270255283285257256263261261256251235248244262270286252249250257242239281223259228219219235243223230220242241264267258262252270264247259254236265252238263209225229248258209214209229205256223300276217223216226208215197212229189229190206203198214200213216213223227259241224219218217223219219204219214201208263240216218239208233236248244263229271256224246214262233225218195261202226218237230213242230230294264189195219242196222230212231224220215229229220231231235203230204244211224230238220233249209220214250199199189222234 967100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

658 79300000000001 045 006 47600000000000001 464 768 3060000000000018 409 835 75500000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %138 133 11399.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %137 888 76299.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %244 3510.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %69 272 41550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %134 850 95097.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

16.3 %22 530 28516.3 %83.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 102 766132 79790 930383 422118 766111 894140 723192 26283 837130 44463 43274 08474 734101 34285 406100 53075 50091 079102 760140 128143 177137 635179 010135 737214 527344 98023 219626 45633 67330 88062 37763 26829 20277 86532 62131 69352 77567 11919 808104 2661 556 44476 41874 488124 130107 108198 745167 555278 953413 26548 57565 34059 78575 59638 72368 78769 46949 580175 63350 939102 060125 022 301051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.83%99.77%99.84%99.84%99.84%99.84%99.83%99.83%99.82%99.82%99.82%99.83%99.84%99.83%99.82%99.82%99.81%99.83%99.79%99.81%99.82%99.83%99.85%99.55%0.17%0.23%0.16%0.16%0.16%0.16%0.17%0.17%0.18%0.18%0.18%0.17%0.16%0.17%0.18%0.18%0.19%0.17%0.21%0.19%0.18%0.17%0.15%0.45%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped