European Genome-Phenome Archive

File Quality

File InformationEGAF00004856845

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

125 598 995229 527 526320 648 331373 236 758379 570 228347 819 445293 095 136230 774 754171 746 894121 642 26282 919 05854 621 08435 038 17821 997 59013 599 6438 360 5135 156 0073 227 1562 080 9341 394 871978 910725 398562 242458 644386 928335 356299 375269 439242 411220 964201 526181 417166 886154 411142 727132 028120 705112 723104 25998 22691 15984 29077 72871 48766 87662 23657 90954 49651 51948 49346 60743 74540 92438 56137 56536 03734 56033 02431 43029 77928 53328 30026 82625 80024 48123 68723 21522 16221 47120 94021 10320 16620 03319 21818 36917 70017 51717 03716 89616 10116 00215 22814 43514 84514 24114 09813 59412 91912 66312 29312 29311 84011 64511 51110 86611 08810 44110 20010 20510 1659 8699 6789 4299 2859 0748 8458 8668 9208 5968 9208 5658 6988 7038 0418 1958 1428 0588 0447 8817 8367 6667 6487 4797 4427 2266 7446 8826 8336 6586 5696 4426 2856 3436 2386 1246 1046 1515 8815 8195 4085 4425 5815 5695 2345 2495 0455 0454 9765 0425 0094 7524 7174 8904 7444 6244 6134 6734 5074 4664 5044 4934 5164 4084 2644 1904 3534 1283 9774 0464 0594 0454 1374 0314 0653 9343 8593 9103 7793 7683 6543 6813 6083 5073 5223 6973 5153 4813 5543 5363 5413 5473 4133 3783 4483 2893 2993 2203 2513 2093 1553 1632 9552 9732 9463 0293 0102 9602 8152 8232 9112 9042 9172 8782 9242 8482 8632 9222 7552 6342 5922 6212 6622 5362 6282 5732 6122 5652 3952 5082 4262 4762 4352 4652 3732 4552 4332 4002 3132 3072 3692 3022 1612 2212 2162 2502 2472 2152 2482 1262 1342 1262 1342 0772 0662 2101 9921 9402 0161 9332 0221 8471 8621 8491 8081 8281 7681 8641 7431 7871 7561 7701 7381 7781 7391 6801 7471 6531 6931 7281 7061 7011 7171 6371 6461 6541 6251 6301 6431 5631 5301 7221 6091 5801 5791 5271 5461 5001 5531 5181 4271 3851 4761 4461 4791 4291 4241 5001 4301 4211 4981 4411 4601 4461 4211 4051 3271 3851 4161 3601 3901 3061 3671 3591 3001 2421 3001 2731 2821 2811 3121 3321 3691 2171 2741 2831 3121 2901 2271 2351 2101 1511 1221 1661 1951 1521 1181 1321 1411 2071 1841 1971 1571 1761 1561 1041 0871 0951 1201 0991 0511 1581 0991 0951 0651 0961 1031 1021 0521 0541 0611 0361 0779791 0299589969821 0039979311 036981990971878974940970908913905920865905940963948920840936864922934908898920966843885912886859858861854823817826809808841817815831814797856762781834837787817793745817807773752794772766749715705767696738733710674681716722642662630675664621638705687702675658675696651676706706705632658688663655637644632729693624596635603610634578607609595619585579626611603596609599605639543596591598593564549579568582569551540567585555561548521558537550557572577519566516555533564575560556565580585540606638645603562551551556575587676602551525534527537523535525547496525513547557517498548548547578536524537532520574566514554547551525513525480518549527511493480494486475467459470489477417478469453458477476455465451453446457465462414463428422432441421435428445384437440419407429416445379407450401429416436407445426474439443405380424427373446415400356403377428398413420377417384362350341400388351376357384375386366382331363334355363382378343328364373374363348329346343316345377355380364337343341353354333355342378379358316304309333356328348355350311384354356330310376365359385367398331364334351338318313350323365353354348359329337325360327316351321320295323330296290308326312272306311316288316320306320336267316305318282310324301288301322302311289301321270285309288284277301270278294286286271292254265257269266276250267280255271261259253225246249246254268259268255213234236238214254255239292250288266277298278255244264268276236250274266246255266252275266250276273276245264278254307240274234284263250244281246234249244251280235239266259277260273274246256265248271309240285275267252252257256275229259254210237283277257248244264264236253232243268263248282268278262265275265268290298212288342298263290264236272270251260256280257274268245275259271230223267244216240247266265259290245245271276283263294256263287250254244232234263249235265214248245279236230237258257249239239275241255247 374100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 378 788000000059 323 648000760 851 599000000000441 020 7320000545 807 70600001 196 444 59900002 647 017 87700012 050 929 66900510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G12G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.1 %116 185 10099.1 %0.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %116 125 68499.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %59 4160.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %58 618 45950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.1 %113 799 97497.1 %2.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

13.1 %15 340 91713.1 %86.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 160 949116 17170 552136 186104 194109 458115 520160 34676 690124 13360 22351 80267 60680 57342 79495 22165 95372 42596 793138 713145 518139 076183 122148 211232 315362 91022 817590 97033 27132 22257 11466 62131 32680 48534 24833 97349 90068 27219 335101 7131 420 81270 25665 601116 690100 423186 795158 150248 195387 26144 79757 76050 44663 43230 10764 93058 45541 261153 18843 57889 953105 242 419051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.95%99.94%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.95%99.96%99.95%99.95%99.95%99.95%99.95%99.95%99.82%99.91%0.05%0.06%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.05%0.04%0.05%0.05%0.05%0.05%0.05%0.05%0.18%0.09%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped