European Genome-Phenome Archive

File Quality

File InformationEGAF00005283827

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

536 637 464325 442 830146 332 62391 627 74748 385 90331 345 28519 232 25613 332 2459 361 6517 087 3235 562 8464 568 3533 839 1273 330 7482 955 1912 651 2622 416 8792 228 3822 067 3491 932 9041 819 5031 716 5571 625 6641 545 9771 478 6561 411 6561 352 3991 299 1141 256 8461 207 8051 168 4201 125 4731 092 3221 062 9811 033 5931 009 413981 272956 484931 429909 255887 259869 626850 253835 883816 965801 482789 241773 888760 331747 102735 678722 055712 221702 854690 932683 582672 106664 575654 626645 438637 949628 910623 690615 044608 797604 100594 387588 122580 514575 767569 670564 193557 364551 841546 394541 245536 747531 831527 371523 696518 139513 556508 666506 104500 809496 961492 920488 560484 920480 474476 063470 708467 694464 168460 035456 575454 607449 002446 895444 322439 659436 169432 653428 345424 288422 309417 344415 728410 644406 862404 792401 450398 393394 567390 600388 016385 741383 061379 795375 309373 036368 836366 936363 703361 201357 397354 508349 385348 383346 425343 066341 189338 838335 603332 533331 463328 266324 808322 532320 774316 317313 425310 598308 347305 066302 267299 987298 597296 538293 758291 555289 873286 794284 409281 110280 677277 203274 684271 681271 552268 150266 684264 404261 690259 669257 083256 469253 747249 722248 043247 301244 829244 046241 595240 026236 952234 854232 994230 987229 332226 785227 040224 728224 291221 713219 127218 105216 533213 753212 702210 493209 106207 341206 825204 607203 517202 303199 782197 301196 663193 860192 322190 867189 665189 180186 956185 711183 799183 192181 862181 003179 556178 559176 724175 753174 177172 924171 248170 609169 198167 236165 817165 074164 037161 918161 061159 354158 325156 762154 970153 643154 099152 401151 112149 559148 327147 905145 836145 383144 390142 728141 545140 461139 312137 714136 480136 173134 412133 724132 826131 740130 623129 091127 989127 193126 199125 072124 406123 129122 385121 073119 239119 078118 148116 267115 263114 181113 308112 678111 956110 212110 147108 850108 055106 484105 774105 465103 876103 814102 604101 895100 77399 76298 18797 96496 67995 76695 39994 33593 84693 27892 90991 67190 85989 92789 97688 38187 66286 53686 21985 82684 43383 40382 98182 03381 37981 26080 09279 22178 38677 94377 69576 72076 13675 72174 98373 96973 42172 28471 98171 02370 76769 50768 89268 64067 70567 04966 39166 20065 30164 29363 84663 17162 26662 21161 56960 79660 15660 43559 58358 90357 94658 18657 49256 60256 41455 40954 71054 51153 86953 04253 04252 33952 00251 02051 01550 30650 08149 04549 02948 04447 62647 49646 36546 13945 79544 92744 52044 30943 71143 32143 16842 96642 04441 88841 23540 88240 77340 06539 85339 25338 40338 41437 69337 81536 97636 29736 27235 96935 57935 45334 81434 39134 14934 24533 74733 50732 77532 69532 22731 66931 97830 90731 49431 24630 78630 34029 69129 41028 89728 86428 72028 55328 13127 69727 20927 16326 76226 60726 41426 06725 54425 37324 97724 99224 58424 35924 08023 87323 46123 05322 88723 02922 33622 38922 08821 96421 86421 56021 07920 89720 54220 54120 34320 06520 00019 92119 51419 62819 14018 95118 55818 51418 02617 92317 76617 45017 44516 93316 70416 74616 60916 18416 06315 61815 49715 30514 82414 56814 81914 60014 27614 15313 88313 98213 66113 67213 58413 32113 10813 19812 86412 93612 50112 75512 45512 30112 23811 96011 95911 81011 63411 34711 15811 22511 27410 86810 72410 89010 69310 66810 46510 39710 2159 8659 7979 8079 3929 3059 2409 2599 1308 7908 8028 8388 7278 5568 2888 1238 2117 8627 8367 6077 6727 6817 6637 4207 3387 1616 9616 9726 7536 7896 8436 5356 5196 3666 4546 3526 2916 2606 0846 0395 9475 9255 8345 7725 7075 5915 4235 4785 4685 3685 1875 3295 1304 8844 8304 8244 8044 7374 6204 5474 5354 5204 3944 3314 3314 1684 1874 0253 9674 1204 0913 9284 0123 8203 9023 8523 7573 6953 7603 5573 6183 5843 4533 4283 3503 3463 2983 3833 2273 2513 1963 2543 1523 1592 9973 0263 0042 9482 8322 8462 7662 7012 6452 6062 5152 6022 4942 5402 5272 4692 4052 3742 3672 3662 2752 2972 2762 1652 1842 0982 1132 0972 0532 0901 9921 9521 8671 9151 8841 8231 8781 7641 8211 7631 6951 7671 7561 6321 6801 5361 5351 5021 6061 5171 4811 4861 4551 3901 4831 3551 4311 3591 3401 3381 3111 3531 3011 3631 2451 2871 2761 2531 2331 2121 2031 2031 1341 1681 1871 0831 0891 1061 0511 1421 0381 0741 0551 0731 0271 0431 0321 0849809611 01897191392394492894993489487591686690684778987586881378983277072677277873471278675373173972967771466866667066067060266266563265460163560756258656159652556552457557955950252749850449644448947347145744644143539944445946744644142444543441941739547540040440338639838639539239842338638739041340039238843537837039139442340136337136936931935434333133330131433031032131332430227331331830229831429031231332029029928227130230028526429227628726226526726426627426827422527023026527724323624122123323722623821722721724323424621124021321921519221322021818521720319619920822321120619522819421019519120220920621318720017421417919822622518619220819218817319319119920417318121817619118719115916418321218417915818919119419618118418117819819820617220617020617718819816718520017720818119218719917618018017120615619118518318714317514814815515915916813114513615115713214414814313012012213613813812012911614013111812214111411412813010913711013713611212510912512467 577100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00142 795 627000000000045 043 18891 311 861245 983 219220 948 538013 127 0660015 974 6311 303 04051 083 51426 887 13198 194 99346 801 72099 370 146103 104 030198 734 97819 366 332224 297 40976 655 227492 812 811786 043 150401 138 571627 646 775874 160 97912 495 703 3040000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G12G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %138 058 522100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

100 %138 033 872100 %0 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %24 6500 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %69 041 62050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99 %136 736 85099 %1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.1 %16 738 28112.1 %87.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 683 67568 35129 50894 52031 32227 891100 56539 34125 55375 67717 23015 57144 49023 1789 27239 07412 07412 63227 91122 63317 64041 25531 12835 54849 813108 9725 919552 8617 8347 37914 53913 6706 50337 0277 60611 44912 39118 4824 66649 488501 38115 21928 83919 11839 14444 41266 14374 179143 536569 02216 67649 40711 81454 70316 05417 85810 15997 75023 01954 608132 540 802051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.99%99.99%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.01%0.01%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped