European Genome-Phenome Archive

File Quality

File InformationEGAF00004856996

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

88 599 675160 683 194232 148 526287 308 495316 625 790318 683 777298 335 578262 598 264219 679 230175 938 567135 766 159101 362 36773 522 12952 117 55936 099 88924 611 58716 553 60511 039 5367 298 7864 834 3793 219 6462 172 9581 499 6101 064 989790 068606 564478 474396 771338 668291 300256 329233 137211 431194 005179 853166 504152 786140 485130 784120 203112 958103 50396 04989 34984 44979 02973 10768 65764 53061 35856 64453 33650 59447 73944 45742 66840 66538 81737 45935 44934 42733 35631 33830 74229 58328 44127 47426 13624 98023 87923 55122 72921 68620 89120 51020 09119 50818 66318 25318 12917 76117 12616 07715 67315 39514 94914 89114 32814 08014 07813 24912 85012 64512 05611 98211 47111 44211 49711 27310 91910 07810 2029 8999 7609 6439 3318 9658 8198 8898 5748 2088 0148 2747 8357 6467 6197 6527 3177 2497 2007 0796 9496 9006 7036 4726 6636 5716 5376 5266 6196 3696 2385 8306 1845 7925 7715 5385 4595 1875 3665 3835 1745 0975 0105 0064 9944 6384 8244 6884 7794 6374 4504 6794 2824 3764 5664 4054 3544 4304 2144 2323 9614 0563 9893 9263 7573 6433 9233 7883 7873 7833 5813 5693 5413 5033 4703 5203 5203 3973 5413 3633 3573 1663 3103 2533 2053 1893 1643 1833 1843 2143 1423 0803 0422 9362 8542 9682 9022 8562 7602 7942 7102 7092 6412 6652 5622 6682 6422 4522 5422 5422 4652 4512 3792 3892 3092 3592 2952 2262 1982 3482 2632 1902 1802 1512 2112 2262 2322 2392 2492 2092 1732 1152 1712 1892 0952 2032 0992 1062 0122 0831 8832 1482 1002 1762 1042 0381 9732 0452 0261 9141 9271 9471 9141 8871 7971 8331 7701 7631 6541 6111 7241 6531 6731 6221 4751 6211 6341 5581 5491 5631 4831 5171 5271 5421 5341 4641 4371 5101 3541 3881 4691 3951 3731 4581 3821 4241 4491 4261 4141 3871 3241 3221 3611 4251 4291 3881 3061 3501 3651 3151 3441 3431 2501 2731 3241 2611 3181 2931 2551 2601 2681 2011 2371 3801 2811 1711 2431 1591 1851 1791 1801 1801 1111 1251 1421 1371 1571 1581 1801 1151 1031 1231 0941 0791 1411 0821 0691 0871 1421 0641 0551 1021 1311 1261 1391 1279861 0071 0389759709541 0101 0159971 024951944938930973991894949898823928925875839885868892827877836775840778835785725755815765803756802749779816774771733746814772820765756818721709772755636695711653749754749753761703727755672688736726751726729711695713693682705686687636648662651705726625674695663647672647573550545603636603566559521587609593553565552577583590587564588518542583602604581558528536516538562580541540499538551511516523548535587501515553532548501528452474485478485460537478484435472448472412423443425431473466499453510512515459493470484467481505476490459452443437402477458495472477470422480597469446432439462413476410452415403406409461428435447446438449446402419451440419445429451383395468436376423429390460401391422393397399416375412374418398422388435437438374384398390409374394398411422361375371404368358407387369428414419368393395356351357381387372346389367353392357388369379333348348351337359341341363402379385341348330376381345349378341355367323398337384317301327294289318302302310291299329354312306286292319325316310293286290287324321297339310325309308291342317307298293344334287326297292322326315311296274294312297266292291271299269253289266270294331275283330271273267282259233306288290271298305262305292284269259293282260257259238261301264258242278296290283288269282272288272265279239280251269259281277279268268243290317286273274260278247238239248249240257240242272251239217255246245216250241232243238237230212231223228240259241229214253219226218212229211241231204200214204216223214205227236191246203239207231211221253203231209216222215236221220231198202213229230194237223212205212227246220205190176211199226200177206164213193184201205211183210191171203193171195181204175161186154154157163184182189179192168162176191185195175202174200159206174181183204195178175180182186163190173181176179158151199192169186135174186161157175178165187167158182179196170165190177155205184192170180167172163172160172163171188134165158150156145167158122186169162180200241 511100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 982 227000000063 742 454000897 669 735000000000491 618 3760000547 081 87100001 207 791 65100002 579 154 79700015 465 007 60100510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.2 %138 177 16198.2 %1.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98 %137 985 59298 %2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %191 5690.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %70 380 95650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.6 %135 912 91896.6 %3.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

25.1 %35 370 15125.1 %74.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 770 324112 23867 426133 63996 259101 527115 422156 48575 771123 79757 10250 04671 86578 94842 80896 96064 19870 81894 918127 773135 154137 620169 036135 965222 999371 75322 171660 92032 04430 79762 80564 21431 79081 35032 11132 46552 79767 97019 555106 7901 602 18278 32372 778127 252107 748199 863180 670256 745463 62849 45666 74357 01575 59935 24171 12765 56746 568180 40348 997101 572126 304 151051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.86%99.83%99.86%99.87%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.86%99.87%99.86%99.86%99.87%99.86%99.86%99.86%99.75%99.87%0.14%0.17%0.14%0.13%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.14%0.13%0.14%0.14%0.13%0.14%0.14%0.14%0.25%0.13%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped