European Genome-Phenome Archive

File Quality

File InformationEGAF00005283771

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

399 923 218253 303 615105 476 22972 261 81038 342 03925 962 80415 991 35811 166 2057 780 3775 837 3884 529 7193 666 5683 081 6102 661 6112 354 5052 119 5651 931 6321 787 7231 668 5501 555 4711 474 0731 398 7511 329 2691 272 8851 219 6491 168 0821 125 4781 080 4901 044 6851 010 308978 902953 108924 211899 943874 657856 036835 327813 101797 322779 635765 112749 038736 130721 270706 654695 295681 268669 045660 119648 147637 757629 190619 276611 894602 164595 948587 458578 940571 180564 764559 834550 449545 250540 379531 763526 060521 964516 834510 676507 071501 918494 690491 336486 151483 327476 728474 749468 428463 807461 444457 042454 479448 834445 483440 641436 726432 572428 416424 074421 929416 412414 610411 501408 067405 862401 249397 366392 583390 595386 958384 322380 316377 821374 147370 751368 494365 989363 135358 618356 108352 223350 192347 421343 382342 316338 847337 387334 800331 812327 936326 706323 376319 881318 247315 175313 039310 453309 325306 529304 213300 536298 868296 845293 852290 822288 508287 113285 487281 942281 091279 996276 986275 155272 979270 357269 003266 186263 869261 245258 976257 279255 816253 433251 595250 097247 815245 761244 022241 931239 809238 919234 913234 968232 587230 446228 825226 991224 935223 041221 642219 671216 991215 494215 241213 475213 462211 219208 981207 502206 743205 511202 905201 384200 685197 841196 446194 023192 601190 948190 148188 706186 606185 367184 556183 141181 976180 295178 680177 344175 424174 639173 443171 823170 178169 441167 766166 491165 582164 022162 887161 466160 228158 919157 531155 365154 761153 598152 821151 886150 215149 256148 319146 980146 434145 123143 184142 063140 703139 756138 516137 435136 197135 665133 951133 346131 817130 914129 021128 466127 427126 529125 488123 851123 234122 053120 965120 567119 376118 181117 659116 327115 748114 127114 120112 584112 453110 840110 081108 890108 105106 777106 431105 282103 893102 463102 192101 812101 46499 89398 69098 08296 85896 46295 22793 87194 24193 24191 69791 11090 98789 37788 61488 29187 77286 37385 52084 81283 83783 33482 62781 48181 16879 30778 82078 06878 20676 94675 72775 56474 73674 76773 83573 32872 86071 78770 95070 07469 74068 78568 55568 14667 40266 68565 41564 86864 49663 58163 31862 28162 04861 69960 66460 52359 72459 75758 76758 01857 81757 66657 35556 49456 20155 81255 08154 97953 69453 96452 98152 50151 98951 44851 01650 60549 72749 56948 81048 22547 58247 25946 79346 25745 77745 13244 93244 45544 11243 82843 59043 26242 92642 79341 52941 14240 60640 05339 97439 78339 82538 96038 39638 05737 70837 35536 73336 63836 03235 74535 53035 13935 04434 61833 99234 15733 80033 07132 99432 70232 12232 06531 28431 49330 83030 42330 77830 18930 06229 71529 48128 97528 73028 38828 08927 66327 23626 93227 03526 83626 34526 08925 83625 43525 09625 16324 70024 56324 41824 10023 96423 70223 72523 02823 01423 01322 69622 35322 10522 17721 75221 74421 62921 02120 84820 94220 32620 07419 77919 45119 21118 84318 96718 35918 65418 41118 17918 28917 84917 75517 30417 70017 36217 05616 86716 67416 68416 47916 22215 88115 81215 53615 49815 18914 93114 99014 73914 59614 25314 28813 84913 86013 94113 81613 67613 33913 25212 81413 01912 77412 73112 57012 24412 19712 10211 88511 66111 44911 46511 27311 08411 21911 06410 86910 68310 87610 51310 50610 35310 36310 0149 9789 9409 8679 7519 5859 5189 4149 1029 1558 9728 7828 8278 8258 5058 4318 3348 3438 2328 0678 0227 9647 9467 7077 6197 5797 4287 4167 2317 3307 0717 0596 9456 9246 9876 8236 9086 8606 7906 7056 6916 6636 4406 3886 3766 1666 3456 2716 1145 9545 9746 0135 9925 8475 7775 7805 6915 5235 6515 4985 5615 3955 3105 2055 2915 2225 1355 0014 9415 0524 9084 7404 7824 6834 6754 6644 5724 5444 5054 5574 4694 3864 2064 3074 3434 3004 1474 0224 0394 0543 8304 0553 9303 9553 9003 8363 7443 7063 6133 6463 6623 5183 5693 5073 4363 5063 3103 2793 3683 2203 2083 1463 1593 1203 0892 9923 0393 0062 9342 9772 8982 9472 8612 8072 8522 7272 7862 7112 6892 7142 6162 5962 6162 5772 5762 5542 5102 4682 4502 4132 4682 5162 4442 3642 3842 3062 3372 2582 2302 1702 2072 2012 1582 2412 1572 1432 1142 0842 1102 0132 1322 0522 0461 8761 9611 9181 9281 9261 8571 7891 8551 9271 8351 7681 7311 7731 8601 8091 8661 7761 8011 6771 6831 6681 6481 6271 5881 5371 5991 5761 5021 4801 5671 4611 4261 3981 3911 3791 3851 3631 3051 3771 2881 3651 3211 2921 2951 3081 2471 2151 2391 1921 1411 1881 2291 2171 1171 1121 1341 1381 1461 1071 0151 0831 0801 0671 0721 0391 0251 1081 1031 0219971 0361 0681 0229961 0249949699339548939048939089189048778619298998258189468717637998028348138287237427777197526816967046756766506557406676806616696436356916226616646386876586266235886026085986016206025905315885745655194835605345495324955084925185125265405285175284995194765384775234914744854384614784514654224044664214453813813733454003663434004253373613193363343393403453283333343363483283412933053123193312492912692792602642532503012352462362032632242442482332242192642282412202562222102182202232392182412312202092112252042382041742172322132111791652221982172182462401982182282112002211972222081912221791851971741872201741901651741571771661591641581671711881571391581601601721481501441571471651421471611451501351511601711361561221291481441391371451221541341281401301211131141221111271291211071251161281091121141249329 801100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00104 363 216000000000035 711 48062 568 722164 724 059130 964 63508 444 4020013 535 671804 95635 849 96018 346 27762 732 29529 128 44265 282 34369 413 638137 609 81911 923 175160 829 31954 516 502376 957 429655 421 684299 573 414488 601 524649 768 36411 426 401 7980000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %119 548 175100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

100 %119 545 006100 %0 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %3 1690 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %59 775 68750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.2 %118 610 15099.2 %0.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.6 %12 679 20710.6 %89.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 427 23856 10824 07281 00225 41022 76984 05532 51021 22369 45014 20613 08836 89320 0127 59336 3979 5219 62022 16817 87912 49537 44323 74929 39340 15992 6435 279532 0906 6226 44612 20111 7815 48335 9606 6419 76010 58615 6384 10246 475415 72913 05024 89817 18136 71440 91362 38469 017147 529567 01915 55847 09510 38051 42514 03414 6298 88489 56621 78651 759114 324 712051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%99.98%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0.02%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped