European Genome-Phenome Archive

File Quality

File InformationEGAF00004840732

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

50 345 07299 734 212160 611 276222 226 991273 345 830304 318 145311 671 278296 956 944265 856 815225 134 130181 833 982140 552 580104 665 66875 464 58252 832 70036 040 05124 125 84315 857 99210 334 5216 689 6104 332 6742 845 0701 905 7851 320 146948 339709 356557 892454 473383 627330 068292 241264 046239 777217 942197 748182 403169 819159 213147 957137 843128 697119 998112 875104 22899 02491 82888 23080 64876 36571 68767 25763 77059 56156 05652 60749 89147 89545 13543 42442 12039 96438 65336 66135 18933 40532 23330 45229 78928 10127 49526 97625 84125 21924 31423 35122 39622 00621 20320 92219 53419 69318 81718 42818 30417 45116 49316 58916 06815 82714 96014 77714 38214 42213 76713 38513 61013 56512 52912 47712 57111 81011 43611 28711 27810 98910 62510 33310 2109 7979 8359 6429 5079 3849 2679 3039 4109 1668 8598 8218 7488 3258 4348 3788 0917 6617 9557 9888 0747 7777 7547 3537 5317 1767 3817 0746 9616 9587 2366 8166 6636 7196 4426 4616 2276 0496 1296 0266 1846 1275 9055 8555 5855 7975 8045 7375 5385 4375 5875 5575 5175 2225 0955 0414 9245 1105 1075 0645 0234 9614 6234 6544 5424 6104 2614 2954 3544 4814 4194 2234 1714 3574 1164 0184 0613 9483 8463 8433 6863 6173 6613 6173 6953 7253 6893 6113 5353 5463 4473 5523 4013 4253 3503 2763 2333 3793 1503 0352 9812 9903 0462 9232 9772 9863 1272 9672 9262 9092 9242 9012 8032 8262 7622 7512 7472 7392 7082 6912 7232 6342 6302 6202 5562 3872 5332 5112 4722 5672 5382 4742 5752 3482 3912 4272 3922 3672 3512 3812 2542 2432 2342 2122 2162 2022 0262 0092 1262 1362 0552 1192 1392 0212 0421 9671 9311 9151 8831 9841 8961 9231 9091 9401 9461 8161 9541 8551 7601 8691 8311 8071 9221 8721 8381 7831 7361 7771 7641 7241 8571 8321 7971 7921 7531 7861 7241 6851 5901 6101 6121 5891 5961 5771 5631 5711 4961 5251 5751 4911 5511 5261 4251 5261 4121 3671 4371 4941 3331 3861 3281 3681 2911 4221 3331 3251 3791 4291 3811 3891 3221 3691 3411 4381 4601 2841 2831 3081 2361 1971 2351 2181 1491 2391 2591 1361 2211 1571 1521 2381 2321 2701 2331 2471 2201 1831 2761 2831 2051 1791 2851 2671 2771 2051 2741 2361 1751 1181 1431 1351 1131 1891 1141 1221 1751 1671 1791 1231 1351 1361 1481 0781 0731 0881 0821 0441 0681 1111 0611 1031 0361 0671 0511 0361 0431 0911 0481 1071 0561 0431 0019881 0141 0539879641 0169369579859521 0011 0099089219049141 0069809279129448899669809749479849721 035940945917930895965934934889989966939821896903958885878906933887941908929860892883816844872832871817876860834806916831862829870831817869835794796801780878989810782783796794783783778755729762852760785773744689708651748711709684703618711681669707695662649708662692706671670659656657613693653632657644671643705640670693742703691630613620644619638618709629621630629613672615619643655583635642614586596626596609610590614591561583570561524556529548563505549565522550510595579571555597583592579574561563552565612560577570558591541529556537551543522564554560551579540550537561537529517568546523526489559531507530473533494460490485491452498499501461462429455473448469480453511442413484476423413419385428426447417415430447395433430400416424405451413408407410376381410416372370398386377382390379365363384403373386393430446403393417403418425380359403414349385359363376369349385387381375356368371358372339366324344353338369374336362362362379357353360321340317317316323322353382386353374346363356361379319314346312312335337281330304286308324294297279308304289294313274260309267279310323330304287318311302307289304295315286287322293317308328295281289299295337287311318323311311281263296262317273299252304304289268284273293258334288291306324284293277307317275308272260304290258263233250296295252270245319262315276264254286265264280258248243253287237305265227282299277249249320269247292281255272242234230262225261244238249277290245255270270248271295274269259255271239232252268241267255252272295239315266301287272252267298301278277269284281245266255269328273297286272278277288274269254247239295297321281271276300260297296241250266303254262281267255275282258280258273260260275280236259251272247248254236246223255233239225220230223248226240291 966100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 019 123000000065 624 049000775 108 640000000000453 245 0200000547 722 44800001 252 164 37700002 731 801 58700018 356 441 01000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %158 479 22999 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %158 350 40698.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %128 8230.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %80 076 57750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.1 %155 437 17297.1 %2.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

16 %25 618 53516 %84 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 097 908135 87983 101165 740120 334128 530139 412199 94393 932156 28670 82161 03980 03596 65649 797119 57782 19392 555111 263163 106166 646168 071204 554158 867253 178427 19528 428755 42140 71538 41371 70381 28038 647100 21540 62240 05860 34987 23223 849133 5781 811 76890 54582 951147 925121 834231 412201 422304 398515 11456 12775 09964 33484 84236 62278 17871 66649 335202 80151 923111 943144 486 418051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.88%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.92%99.93%99.92%99.92%99.93%99.92%99.92%99.92%99.94%99.91%0.08%0.12%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.08%0.07%0.08%0.08%0.07%0.08%0.08%0.08%0.06%0.09%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped