European Genome-Phenome Archive

File Quality

File InformationEGAF00002337624

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

186 456 675303 781 517384 259 791409 755 313384 982 704327 659 109257 663 678190 078 227132 973 69389 072 89057 704 35636 320 79122 423 80513 680 1728 311 4985 095 8603 168 9972 026 2191 346 732935 014692 156528 704422 561348 966291 080251 873219 122192 442172 825155 685139 241127 254115 808106 67597 99188 57882 93876 65569 90863 80758 35054 55050 73147 24444 23541 72539 13836 07434 52132 76631 55630 11329 45627 82226 49724 53223 55522 48321 73820 80419 91919 26118 52717 27417 53317 08016 80516 52115 47215 25114 67413 81513 17312 83412 80212 15011 84212 05411 66811 21010 73910 55910 34810 0559 7759 6249 7169 4098 9738 9848 6328 8008 2278 2288 2017 8167 4937 4547 1417 4317 0977 0557 2487 1247 0486 8666 6426 3966 4816 4016 2586 3336 1586 0685 9276 0486 0945 7825 5455 5495 4075 4065 3715 3445 4505 0825 1374 8934 9854 7274 8874 6374 5864 5674 5084 5444 4774 4024 3714 1684 1094 1544 1004 1343 8033 9673 7513 9954 0233 7983 7223 6533 5233 5703 7303 5303 4063 4873 4193 6763 6403 5173 2933 3033 3003 2823 1543 0993 0823 0233 0463 2723 0242 9462 9962 9332 8462 8512 6862 6532 8012 7542 5182 5482 5572 6362 6032 5652 4572 4402 4162 3232 4462 5922 6952 4372 3102 1902 3222 2302 1772 1052 1672 1792 1802 2582 1892 2122 2172 2762 1722 3562 1042 0591 9532 1942 0252 0592 0862 0882 1061 9782 0961 9341 9722 0221 9901 8801 8731 9402 0181 9321 9921 9031 8651 9311 9731 9901 9911 9151 8792 0271 9261 7911 8381 8451 8811 9181 8031 8611 8501 8511 8431 7921 6861 7551 8091 8421 7151 8711 8321 8351 7251 7491 7101 6731 6331 6451 7771 6541 6491 7811 6881 6541 5831 6301 5661 6161 6181 6031 5871 5951 5641 6011 5001 5061 5631 4731 6131 6431 7551 4871 5041 5031 5561 4861 5281 4871 3721 3831 3831 3701 4501 4441 5951 5391 5091 4051 3661 4541 5161 4121 4861 4511 4611 3941 3991 3591 3801 4391 3491 3741 2981 2961 3161 3091 2771 3461 2701 2491 2701 3271 2311 2541 1211 1261 0531 1591 0481 1681 1531 1341 2251 1081 2021 0911 1641 1341 1361 1381 0941 1161 0861 1431 0501 0941 1191 0591 1761 0521 0501 0221 0271 0311 0051 0261 0501 0429389761 016986895933983948935919887884878887893853907923848871827887882791871824827830871787782764804798844810795827800819804838747769807821778812808771800797777756761772703711719753723672694682721722739692760698664733701720741663676678675728671662700679661735705680678688711674625679635607711601645646679675713685598660612655627578596645575614625573619587607595627625614613550513540588520561597518548538541541566539527533495493514536519522507558540551558568571588575557546497511481470482447502511462502480507495498496470507481485531499578526542547494542496491476500482513468444448484492520460417422475423427474481451471435433472480467459470518466477493483473446466475472411408432460462421429448466433435446400436409416390372408430418454405409394440443422448390412403361391395375407381389369360369353375360331351364329353346336338313352321350319344321343340340334320335319317309346305320330327351360352318315313366362344314322313319311282323338299323265317315311303312322320326337315354305354344336411367322320341369390337367287347328331300344324308331315316315276332295284317339317311326326352336314340343329301330297298289301292306278291298318302299290282303276315248286254311285267276251300298262265249286270243256248257273225262243264240271238235238247244246203251233272210243234261239291261264241260259262279261279255280249288252287292254246241272255262243268257262255260259245267257243237243271260243270233268262228224228228228235224230235262242255236218232231277251252225204229225197228210238221256243237237232222212217201194212167183207197173184175184155188180200204160158179189198222201209196191192172214193202195182195206214178194209187202199197199188191192194204209207197203195216195190204195221227191213196232185194178203176202187171159214183200212199189191178204170185198191245219211178223185184171184172181151182171161176140148182171167182162157171139165143146155146173153151158166169217 136100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 525 058000000094 559 4550001 180 599 330000000000638 160 5210000644 107 97300001 185 708 23700002 248 666 7190008 422 118 73700510152025303540Phred quality score0G1G2G3G4G5G6G7G8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.9 %94 390 36098.9 %1.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.2 %93 786 18098.2 %1.8 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.6 %604 1800.6 %99.4 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %47 733 26550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

93.2 %88 940 17493.2 %6.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9.8 %9 355 4919.8 %90.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 497 962138 807110 567170 948145 788161 446184 396225 964166 297186 439115 50991 577102 63194 52559 001102 54875 26286 283111 155139 929147 456163 307208 088143 534209 217286 10267 995459 61775 32663 55297 030101 42391 334117 12368 05973 15886 46899 49355 029148 0801 141 45798 605109 519145 748117 720176 762153 516225 013316 45792 40991 26194 13198 92574 844190 453100 40791 756163 39399 043143 10489 815 523051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.4%99.39%99.4%99.41%99.41%99.41%99.41%99.41%99.4%99.39%99.4%99.41%99.41%99.41%99.39%99.45%99.42%99.39%99.47%99.37%99.42%99.42%99.52%99.3%0.6%0.61%0.6%0.59%0.59%0.59%0.59%0.59%0.6%0.61%0.6%0.59%0.59%0.59%0.61%0.55%0.58%0.61%0.53%0.63%0.58%0.58%0.48%0.7%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped