European Genome-Phenome Archive

File Quality

File InformationEGAF00005775083

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

282 016 913111 613 81446 124 63624 203 08813 799 1449 338 1676 918 4435 627 6184 798 6864 212 5553 768 2583 424 5963 137 4272 893 1332 676 7362 472 0412 295 5612 135 8261 989 0141 852 5931 720 9171 596 0471 488 1361 386 8951 286 7011 193 1791 105 8511 025 911953 103880 384814 855752 053691 733636 417584 191537 698496 802456 265416 786384 815352 653325 067298 416272 587250 861230 196212 065196 023177 618163 231149 359138 234126 818115 820106 04397 03690 42983 50276 86371 16765 67060 65055 98751 58848 03144 33540 73038 35436 38133 57231 68029 69227 57126 43924 67823 34521 99021 09219 85918 79017 89116 83915 79315 26614 63614 11313 70313 02512 78112 10111 63311 34511 23210 89610 55410 0589 6149 3649 1708 8078 5528 3368 2067 8177 7347 3267 2837 0716 9106 8486 7596 5926 6366 4686 3316 2166 1335 9335 9535 7225 6545 5795 5055 3735 2345 1575 1275 0984 8894 8304 6204 4874 5594 4874 2474 1564 1364 1704 0103 9593 9063 8613 7933 6003 6383 6563 6943 5273 5663 5393 5143 4133 3273 4033 2043 2053 2513 1162 9883 1022 9972 9112 8862 8062 7812 6572 7392 7192 5822 6192 6602 5672 4882 5182 6112 6192 4952 5062 4432 4102 3462 3122 3642 2862 3762 3032 3152 2842 1962 1432 1462 0442 0972 1602 0452 0822 1672 0692 0692 0461 9401 9821 9531 9731 9481 9471 9041 8211 8241 7311 8131 6851 6711 6911 6131 6821 6911 6661 6781 7781 5591 6911 6451 6141 6241 5961 5471 5491 5441 4271 4901 5061 5481 4901 4321 4761 4941 5041 4731 3931 3711 3931 3531 3231 3551 3241 2951 3171 3341 3281 2681 2621 1951 3071 2951 3041 3101 2111 2471 2111 2371 1241 1621 0851 1361 1241 1591 1031 0431 0631 0661 0901 0901 0341 0121 0209651 010937974931997933927965947913860880883895915902970924888901913913871896855890851825815796792764787785765773806716773696750718690761771753734714684684703705655698637614625608613597565598584559591591541576576549515543550520532572571557524512511509561495518507556486497525508536483455453474477477425480473472433425469455410411422443420420433488410420414429464400393433416419398406408375396387390360390347361316331330370361333371378340368322368320368306300332326335286321307271299296317298291282240276262267269273260254247248249257252268236290226270241249280246260246266250241254258250264254253225258255261260262270258245262226250244221223228230229230235221201232221212187199197213213198208192209183199218209196234194167182193201186188174172164141177160166183176172159163153179164185153174170190158174162186170173149161135136136152160131147145135135145119159147156151164146168147157136148148133166144150124135133127110143127122921131281341211201151191281451141271041411231071101081129611910696110841021069698831078686858071979277819275787276968386666363908383768083751006483908377696969739464796863745557595850686170726778666681626972777664677273646172676463676768658363726568645162704568555969576476665764614661534146566359474866496350414450325564515455525437575038335444504760503038423750473349365440423649313832524135393526433736303236353443355346374535404447334546443743374142422737293035283330372433293028212021221620243324171820181616221912192014191414231517191716151720231423122319141016101411211810162118161919141915201724191516228122112141421181424141510149171921151717151416141413132112181112181413915211519181315171010171312121516161461691617221413172215171112131410611161412521171514121416111314189141316711914161116715135131489610981221189101141453 454100200300400500600700800900>1000Coverage value101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

735 0110000000000038 571 49400000000026 782 669000036 869 9930000123 439 0910000233 385 5000001 881 713 09200510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %31 194 71999.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %31 170 07499.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %24 6450.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %15 609 97950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %30 773 92898.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.9 %4 029 06612.9 %87.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

853 4468 0883 73415 1844 8434 4056 9488 8435 02716 0386 7215 62212 8607 2324 80214 1355 2965 8906 5049 1847 75915 70113 41912 39117 77545 6623 844174 4304 39410 00314 3116 3793 68815 6013 0554 4815 4147 3773 67718 622535 66821 48316 07729 28027 36332 45331 97638 23133 75567 26074 95053 440241 4768 15660 49822 57112 32784 3918 1393 63128 440 926051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M18M20M22M24M26M28M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.93%99.92%99.92%99.92%99.93%99.93%99.92%99.93%99.92%99.9%99.9%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.07%0.08%0.08%0.08%0.07%0.07%0.08%0.07%0.08%0.1%0.1%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped