European Genome-Phenome Archive

File Quality

File InformationEGAF00002240166

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

92 818 402191 652 351303 308 993393 092 756431 157 314410 249 045344 773 017259 935 795178 198 103112 426 59066 065 47436 516 45419 339 4509 976 5355 175 4682 791 5731 635 8451 076 934785 420615 299510 616431 451369 017319 420275 191242 938217 943194 623173 236157 780144 162129 948119 242109 13199 18790 16081 28275 36468 20863 41459 66954 89952 18048 70844 88441 74039 47638 15435 99433 41831 96830 34629 25528 42027 04326 06725 38824 47623 46922 77221 85621 44620 70420 02519 08318 32317 84417 45516 37916 02116 17415 76315 43214 75814 21913 85313 47313 11113 05213 14213 00112 23611 74411 28011 31711 10710 92010 80110 53110 1029 7339 8249 7129 3339 1769 0658 7018 4048 5128 7458 3318 1868 4207 9597 8187 8247 9837 6487 5917 4107 3097 0456 9947 0496 8976 5876 1586 5196 1425 8605 9626 1315 8815 9135 7525 7595 5615 6345 4175 3895 4395 3725 4085 1545 1524 9764 8884 8974 8024 7844 7034 5364 5884 3994 4654 1824 2063 9213 9243 7813 8193 8203 7633 7543 5613 5873 5783 4653 5223 4523 3723 4713 3063 3763 2513 2853 3223 1973 1563 0943 1993 0433 2533 1803 1913 0763 1583 0652 9912 9532 8312 8292 9262 8162 5752 6842 6712 7052 6242 7092 5602 6552 7282 6672 5832 5312 4522 4102 3752 4102 3712 4652 3822 3912 1902 1762 1652 2502 2162 1862 1572 0902 3102 1882 0822 0492 0241 9591 9611 9301 9611 9031 9471 8521 8171 8301 8771 8731 9061 8931 9371 9031 8441 7831 7801 8841 8401 8031 7691 7921 7961 8591 7541 7271 7321 7281 7891 7151 6181 6171 6271 7071 6491 7091 6901 6291 5511 6111 6551 5491 5751 6131 5111 5731 5861 4771 5771 6101 4411 4381 4531 4391 4591 4461 4001 3781 4791 4451 4551 3541 4161 4181 4531 3451 3501 3361 3481 3751 3751 3171 2991 2991 3001 2141 2721 2851 2921 3091 1831 3011 2271 2281 1891 1791 1771 1711 0981 1061 0779869231 0201 0821 0621 1211 0181 0281 0541 0029961 0671 0469849789969649239839561 0141 011967986933944968975934914882935978865859943865861852897921883903830897852827850824812834898896886819837853814891808816736809769790790803759746762852777797827809732792723728789760721712709641678681693701709706699638674634651686667687634635657586631607604640575595568595609537547521530621500522551540536538566542488505474487504477555538546549541560575507538523540520517491564560555561513507511491516555514591546545568567523550559505516485460487486475462457439481453458449467467424438463480482502494486470460437445478474470420418427439429435417527504444421468490452472458487427466473428429382444421416437400410385390386364333402370389382395374330400394394415363395407446388432392437404438450415439390381418412410404359383391383406378388382398395377370394421393396430395405428389403379371390382340341319337363380319368369394400377337348361378371386364348363350353370339374352343347347352358328333302325324320328293308311373332338340333367327365361357378366381338316373331337372362309364319327316361352324336351310341309325342326312321305319328313299289334312328317337311317292298320321310295278261301280287306313261280273258259274277247241246265254246255247279283267248272275267238271253281244205250268258219257226242226226208235209235238239251239270227220240207222210198237183223196238224235198226196217234219233203226209208214216195170203190198208177180200235203218209218225222194164188170152207173154171153182157181152163172145181171193220200190179171183163188162174175188172157207162170147171165167170160165140200253175135142152141139141146143142163140141179162155197166176192182164187180163184183154170176128140146140151168152182154184169181170182138157136153160158163143158172160175198168181179187170193166180163166175183163157203177185159153155172167143169174159183168137145147162148131160147139190145164166173147161170153182149165148169182125134162146173169162179173180165164154158181158163161162197185192186202176198196210215208213219189202178168173194210185176176183196209205210185174171188201199195213130 288100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 372 548000000054 137 509000907 646 404000000000532 031 1890000621 472 38800001 247 964 98900002 511 836 98500011 352 239 65000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %113 738 95399.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %113 459 20899.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %279 7450.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %57 048 68150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %112 291 75898.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.1 %3 577 9263.1 %96.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 885 97977 90547 30994 81266 69669 39184 381105 79850 40287 06038 49734 08052 80454 04129 19468 45846 38652 19169 23286 42389 99990 052110 87286 333145 318242 57714 717489 74521 64920 83150 37943 30720 46755 85722 28621 76739 46547 49513 40876 4311 303 07958 91256 47792 93178 841145 004132 078186 671339 21736 02650 76143 25360 42330 01052 49552 88838 625142 64740 23580 313104 552 651051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.75%99.74%99.75%99.76%99.75%99.75%99.76%99.75%99.75%99.74%99.75%99.75%99.76%99.75%99.74%99.76%99.75%99.74%99.76%99.74%99.75%99.75%99.82%99.67%0.25%0.26%0.25%0.24%0.25%0.25%0.24%0.25%0.25%0.26%0.25%0.25%0.24%0.25%0.26%0.24%0.25%0.26%0.24%0.26%0.25%0.25%0.18%0.33%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped