European Genome-Phenome Archive

File Quality

File InformationEGAF00003608789

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

76 585 55594 908 00190 206 44791 157 93869 765 66365 054 52945 364 12843 600 45329 639 26529 813 72420 620 10221 056 08815 005 62015 156 74611 159 37011 009 5938 414 7768 117 2636 440 1046 133 0815 046 8594 754 3293 985 4023 736 8233 225 8672 996 4632 654 5392 481 6242 252 0292 107 8011 929 6151 823 5831 699 1711 613 7451 531 0891 453 7411 380 8051 326 4151 275 9831 231 4921 181 6921 150 4911 122 1931 087 2941 060 2691 033 7811 008 892987 804967 187949 801929 183910 610897 110878 260859 041853 250841 374826 889814 940801 369792 603780 648772 476762 691745 907737 290733 092721 470711 741702 109697 297686 653673 884667 310664 001654 901645 676640 936632 494624 595618 884612 424604 950595 745587 983584 349577 840571 429566 076561 868554 230542 889541 455535 261531 402522 990518 954511 840507 775505 156497 259490 118486 436483 961476 221472 760465 847462 054459 759453 550449 145444 370440 952436 336431 094427 307420 125420 659416 852411 966407 640403 992401 040398 189394 032386 593386 102380 989378 817375 266370 684367 543363 988363 622359 988353 391350 979350 318344 356341 556340 100335 608335 853329 407326 989323 727319 837319 095314 770312 962310 291305 844302 889300 024298 915296 178294 586293 681289 316287 664283 106280 096277 463275 439272 982271 237268 898265 044263 653260 797258 552255 726253 665250 670251 056248 818244 876241 812240 222236 640234 853233 885231 045229 681228 912224 980221 636220 422218 767217 022214 477212 423211 365208 784205 176205 935202 658202 516197 218198 051195 146193 675189 442187 573186 971184 087183 747180 941180 439178 085176 217172 823172 105169 796168 400166 549163 638162 993162 424159 398157 473155 829155 054153 277152 895149 174149 379145 807145 813144 537142 641142 367139 809139 165137 291136 326133 788133 774131 485130 151127 403126 655125 732124 278121 972121 870120 402118 290116 361116 052114 760113 483111 291109 388109 463107 655106 165105 438103 844103 518100 885100 133100 81398 28997 62295 20694 67694 40292 91591 75290 71288 89888 90287 64986 12885 18884 37984 29882 75981 06380 22679 15078 52777 17176 14074 61374 38373 51672 06571 70671 14070 33669 65068 13667 55266 93465 76764 55064 61663 37663 70662 41861 42461 12759 69758 73958 54357 47256 59256 03155 24054 39553 25553 44252 99151 71350 92749 83849 66549 02248 05947 71746 56546 75345 31844 33744 71343 67542 56142 61141 96341 21240 83039 94440 01539 17638 39537 97136 99236 75836 39835 91435 10335 08534 08034 17533 19133 16332 64732 33631 41630 86830 39329 87029 55529 46928 56628 52128 38727 38827 59826 71827 00526 51926 18225 31825 10724 57024 56423 98823 69023 42622 80622 60122 62622 26221 58621 43021 05520 66720 83819 96719 86719 57219 03619 14918 58918 38518 58617 68617 46117 34517 40116 93516 70216 49916 09716 26215 81915 41715 30115 31214 81614 69314 71014 34814 07114 21213 79013 41313 59712 64713 01112 86512 46512 31012 24812 22411 89511 56711 62911 06111 50411 00610 76110 83210 59810 0789 96310 0009 6879 9389 4139 2889 4498 9729 0348 8458 6568 5858 4638 2758 0678 0147 7937 6957 7457 4257 3907 3137 4777 2766 7646 9686 8176 5776 3176 3736 4636 0866 0436 3285 9415 9875 8015 7355 7885 5565 5285 3645 3675 1215 2045 0254 9974 7684 9754 8394 9054 8324 7014 6074 5524 4824 3644 2994 1574 2034 0574 0803 9613 8003 7533 6623 7763 6693 5013 5173 4623 4823 4083 2953 1853 1413 0502 9403 0882 8272 8942 9992 8392 9922 8322 7472 7002 5712 7152 6442 5732 4332 2952 5652 3302 3162 2572 3062 1842 0882 1782 1491 9922 1162 0491 9361 9222 0091 9721 8771 9061 8441 7371 7511 7331 7751 6511 7451 6771 6061 6311 6601 6341 5981 5371 4771 4761 4571 3151 3301 3501 4801 3751 3581 2461 3391 3001 3331 1991 2411 2471 1961 1831 1651 1551 1771 0841 0891 0771 0141 0661 1899931 0431 0341 02295193590795189491693692087287389587486285484788077881275482381381574181284076676476270972371774870566670863068266068363869558963959958861361052052653656251952755255150652553047155249552552547247948749751051748649054146449145448043545341748941439938840643944538541839338840838440738736234034831634733931832433335533229633334134933432633539335032930729934331635233332233736031634433033131229031729331226930427927026329125827827226625726125121021024321322621321026022620718520819321721822620117921119419620019919020418815821222419821422023522819416817918918916319221319016220020416119319118818519120219919619517820116617617819918619616616518814918218620329417616417317616415915616715815315615017217316216915617015114914417415016817217114617413415716715517618514015314614115512914213514913613012915415215316712113815712614113613913612312715913915715813614113512312112012710514911111211111110012010711612312688127135130113101971351271171131011081131368895951121041131151201009699113109106961031209890107108921151311967879103981018198103112106941071069288898299838210969909589103958510195119859183107100118101978893117898798888299738293978495818579729481871017993778181738777807878746042 493100200300400500600700800900>1000Coverage value1001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 702 9190000000000523 161 5490000000000000824 845 3800000000000016 833 306 90400000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %120 289 36599.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %120 194 02299.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %95 3430.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %60 211 97650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99 %119 268 49699 %1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

72.5 %87 286 37272.5 %27.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 308 63021 99712 29237 92915 19917 23622 97238 90322 88938 01011 0329 28615 14516 3848 24432 53110 22712 07116 95019 15416 56437 66230 05925 84746 132109 5214 785501 2206 3016 21918 85613 9747 80831 7866 1986 49312 08017 5354 43445 062619 30727 14918 83445 55627 51677 77478 793152 497425 15520 49936 85229 94146 58615 26426 69227 64420 742133 75720 37851 562115 216 773051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.93%99.92%99.93%99.93%99.93%99.93%99.92%99.93%99.92%99.92%99.92%99.92%99.93%99.92%99.92%99.91%99.91%99.93%99.91%99.91%99.92%99.92%99.88%99.77%0.07%0.08%0.07%0.07%0.07%0.07%0.08%0.07%0.08%0.08%0.08%0.08%0.07%0.08%0.08%0.09%0.09%0.07%0.09%0.09%0.08%0.08%0.12%0.23%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped