European Genome-Phenome Archive

File Quality

File InformationEGAF00001843032

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

235 116 214202 235 182170 374 292146 294 883127 977 164114 152 271103 340 65994 540 42687 121 72380 675 85075 084 97270 035 15265 429 27561 217 62557 300 54853 722 90350 314 11347 162 63244 214 20941 475 68738 887 00936 461 35134 216 69932 129 32830 153 18328 279 86626 554 01024 935 80923 402 88721 988 30320 665 62619 400 22818 198 33117 108 35116 043 92115 065 59914 140 18613 273 48412 455 12111 696 85910 966 31810 283 8809 659 6359 055 7948 498 8207 968 1157 477 2367 010 5646 563 1946 150 8875 765 1415 401 3475 054 4994 738 2134 437 7864 149 7293 877 2063 625 9983 386 8973 171 8472 962 4852 773 1082 586 9352 413 9802 256 2442 100 3561 962 2351 828 5801 706 8741 592 3961 486 0481 384 8041 288 7221 196 2881 114 2621 037 785964 589899 165837 678779 164725 744673 511625 884581 989539 045502 015465 225433 309399 585371 590345 652321 549297 856275 330256 589237 049220 616203 775189 321175 735162 244150 496138 707129 270119 586111 036103 11395 90689 19682 86476 95471 52366 08461 33857 14753 19849 36545 75442 96839 59536 50634 02731 29629 58827 59125 65224 05822 22920 90819 34218 23617 20916 13114 87214 08313 61412 73812 37611 77510 99010 38410 0169 4689 0738 6938 4287 8467 2997 2416 9456 8426 5206 2916 0035 7905 6235 4565 3825 0475 0034 9834 8014 6884 5664 6004 3754 2454 2354 0924 0424 0153 8643 6803 6133 6563 5573 4513 3933 3943 2813 2583 0633 0673 1123 1192 9952 9542 7942 8602 7012 8432 6242 6112 6442 5972 5282 4042 4942 5012 4062 4042 3432 4142 3602 2802 2002 2392 1752 0762 1752 1372 1162 0071 9781 9011 9691 9111 8831 9911 7731 8471 7801 7721 7521 7161 6551 6531 6781 6061 5431 6221 5471 5931 6041 4751 4921 5411 4521 4531 3931 4021 4031 3001 3951 3571 3181 3321 4201 3301 2701 2441 2631 2561 2671 3351 2311 2541 2591 2181 1411 2571 1801 1711 1971 1841 1341 0831 0391 0911 0559661 0269681 0081 0341 0181 0579559779929529449859279299348758478548308618197387708328147777678177427657237557867697247127617387107207266647196846876596356486637126666195946076465766156186366266015646105345726055415946205705945445155435635285675315275535054685265174944785344714774874714714584814424514364374084463784894114374274053993774054054184113783663633523653453773503433483383223373343403113383413402933053573103383433423413233263313493563473513403532963213102843083193173032962813053043122932842982622502742432352572612882582612702912522812422612632612362522552482392312442432322382232412552582392752092202292412392192502012472082072181941892552072002142142342012112172182392042012262222392272402182222262002031931952172001832081981961692031732341802031881801792002072312002071881731601782211691611711631421781711742141691781681591851841642151341411611711611781791721721631451611371521551621531311561541621601461461551551361791431531441541501561491321301411551521591431641511511411491781411421521461461341171641351471391521431341371311291791481561491191461621301551591281251261251311411361121391161331541321211231381361191411371421251191291551281501271001421411411441381221181381041331381321441501281251431051421431291191471401321401301201111241181201171381141241231041251259711213012310997140109110125132110102113120121109107117110134129103106103112132101123121120101109117102128112111101116107122104114105110123105999711410211712211110710210998107101119981091071001131071159910910210110310810110794821051099189861009610790831029195858910394106958590817610391929510210910389999486918488908899106841209790761068570999010410510811192869989101807677868110388869690102101828688741039678929376951178695908287848299889292979696941021039986958587106968682881039490988484967395778087877674859199858174769167699573598469617185797582966686857879787877857784847766727574717574888682786477697678706482758280787371686759877778817469686374567262767879826767797676786672557170767167586377549860887692 053100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

002 520 0920021 135 15316 475 097147 480 22081 099 75743 159 96768 074 64826 652 10315 772 05639 484 07615 455 71834 408 27231 523 03033 439 45183 996 73598 265 17379 828 03539 281 46686 372 18697 929 841121 310 721209 360 954253 288 933194 532 486268 628 367321 592 245566 967 939874 571 540604 642 1061 022 055 6311 874 822 3842 536 735 7811 719 585 3673 493 917 9512 485 312 8385 414 493 6165 517 018 2598 931 007 39100510152025303540Phred quality score0G1G2G3G4G5G6G7G8G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %450 426 171100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

36.1 %162 805 39036.1 %63.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

63.9 %287 620 78163.9 %36.1 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

81.9 %369 023 47681.9 %18.1 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

100 %162 805 390100 %0 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

0 %00 %100 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

13 344 1432 087 7564 813 689316 366216 0273 580 8021 830 6077 396 151922 7231 526 727382 199978 406448 909900 5661 033 759856 6541 043 37221 526 69035 726 6211 015 8401 047 878802 9875 566 0772 694 4583 209 20719 0732 274 1932 872 9181 580 5932 721 9652 941 7123 507 06133 284 321287 955 7210510152025303540Phred quality score20M40M60M80M100M120M140M160M180M200M220M240M260M280M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped