European Genome-Phenome Archive

File Quality

File InformationEGAF00002395368

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

1 674 2991 011 231844 120789 756777 785798 892840 051904 490969 8851 061 8191 177 6911 306 2981 464 5001 642 8201 873 2832 121 5912 419 3532 773 7703 197 7473 695 1084 307 2795 053 7726 015 5967 204 0408 732 10710 641 62913 048 90616 063 90119 758 72824 203 39129 506 87135 633 33442 617 12050 314 53858 702 60567 538 13676 673 59385 836 19794 715 143102 971 165110 362 605116 536 203121 394 524124 657 112126 323 682126 185 972124 391 217121 031 765116 256 536110 298 490103 329 99595 627 97987 437 06579 078 57370 682 56962 486 96154 629 13447 286 21240 510 38334 355 29528 875 30824 039 76419 858 61416 267 24413 242 18410 702 1028 592 5296 871 2875 474 2814 352 2443 466 1102 765 1522 209 1841 784 7471 450 0931 193 821998 673847 495730 030642 261575 481522 877480 356448 274421 326399 663381 670362 844348 150335 023322 087310 298300 256290 736280 287270 724262 051251 421244 879236 849228 567218 263210 944202 697194 560187 240179 700172 647167 151160 934155 291150 067144 888139 966136 439131 833127 688124 313119 613116 674113 112109 772107 344105 825104 136101 97599 86598 55196 66893 74992 15290 21688 72587 68685 95284 80982 93081 27779 76377 85176 50475 37074 77572 88871 59470 69069 79767 57466 64165 87764 43063 24762 73960 90259 33958 17756 70755 47255 02253 77853 25651 81950 77149 88048 78647 37946 59745 51444 57344 25543 06042 52341 60440 66239 74539 35838 96037 82637 34136 44935 63234 95834 29032 93933 39932 65932 33831 67631 06530 83630 33530 14229 84629 70829 20828 66128 72528 25027 90227 85627 42727 01326 90025 97125 74225 36624 72024 53924 15424 14823 88823 84023 74323 73223 38422 78922 45422 60522 29122 06921 92221 46521 60021 30821 01321 03521 07420 49020 41120 28620 08120 25419 68319 42119 42818 88918 82818 52618 38318 26318 22918 13917 73917 93117 76417 62317 37617 70117 19016 80316 82616 81016 79116 18616 08315 97815 69215 64715 39215 35515 25515 15114 92714 70414 76614 50714 22614 28414 03914 03713 76913 56413 41213 03113 21112 96912 69012 42712 31512 06211 95011 69811 64411 59211 41611 49611 30611 23811 21411 03110 85710 86810 73910 52710 39910 52610 43110 48310 20010 0469 9369 8809 4999 4429 1049 0578 9338 9608 8038 6658 4588 6178 5888 4818 5598 1358 1198 0248 1888 0447 7917 6757 9477 9197 7977 8467 7847 5487 4917 2287 2277 0846 9627 0766 9046 9757 1426 8977 1866 8776 9576 8956 6886 5386 5856 5416 6996 5566 8536 5506 7166 5636 6126 5806 4156 3656 1626 1696 3135 9815 9926 0306 1446 0915 9506 0416 0735 8495 9516 0105 9175 7745 6495 6495 6265 5555 4975 4985 4785 5235 6065 7035 5675 3775 5295 5365 5745 3515 4825 3305 3935 4245 2635 3115 2125 2215 1125 0925 1265 0944 9934 9734 8544 8404 7684 7154 6844 7184 7284 8114 7624 7894 5764 6434 6314 7034 5834 4824 3604 3694 3594 3014 2624 0834 1934 1274 1014 0684 1504 1044 1124 1754 1324 2154 0484 0684 0244 1374 1344 0934 0014 0834 0313 8944 0663 9453 9903 9863 8783 7603 6713 6453 7773 7063 7513 7473 6883 6523 7053 6283 7003 4663 5003 4593 5283 5203 4373 4103 4493 4133 3723 3413 2603 3343 4243 2083 2983 2823 3013 2423 2683 2163 1913 2043 1983 1463 1273 2203 1023 1763 1463 1322 9882 9262 9342 9092 9362 8562 7832 7792 7072 7712 7162 5012 6012 5972 6272 7682 6652 6412 6072 5512 6802 6352 5492 5412 5992 6042 5372 5092 4762 3642 4492 4092 2862 3422 3012 3052 3412 2912 2962 3542 3172 2682 2042 2652 2812 3492 2452 2042 2932 2532 2782 2372 1622 1312 1592 0592 0882 1742 1471 9781 9932 0891 9942 0102 0552 0131 9052 1122 1511 9912 0592 1202 0522 0911 9642 0041 9061 9841 8921 9921 9792 0121 8651 9031 9131 7691 7461 8211 8361 7561 7421 6861 6771 6491 5371 6831 7261 7281 5561 6371 6041 6041 6341 6441 6591 5791 5881 4991 5281 5571 5701 5161 5351 5061 4741 5361 4811 4371 4971 5351 4791 5021 5181 4881 4501 3911 4551 3561 4691 4221 4161 4061 3811 3611 3361 3271 3481 3221 3081 3351 3401 3261 4211 3521 3841 4251 3491 3081 3081 3261 3371 3351 3551 3611 3141 3661 3471 3861 3671 3331 2831 3761 2561 3301 2681 2531 2881 2631 2491 3301 1951 3891 2851 3031 2861 3321 3321 3831 3001 3611 3171 2651 2691 2341 2191 2101 2521 2381 2201 2281 2091 1921 1271 1581 1681 1031 1281 1491 1371 1071 0671 0241 1111 0301 0921 0571 0791 0751 0241 0321 1861 0399801 0361 0441 0451 0991 0431 0451 1551 0551 0871 0441 0591 0311 0121 0251 0851 0351 0231 0159399531 0421 0291 0631 0231 1499651 0131 0079581 0031 0029949291 0649351 0019889999689649809288999539108478979349569148658758688748828549189578529628789221 029950962943946864861905874864799803861745825814772814840818815755785741791785766816808808839836751771786791698746739702693626680686722723760694738746722730692755682746709709719715719735726720732743664736763770723700711723745660736756713690685661685706704642658659692672668667655657651661664693695682658616631659621623611719616606589577631661638592667668703681679684638642625670602605624614648587614646589591603598576627599625627635626613564593616585554529606547562501560594601553494515530508554507523558558576572512530563520544581554538516560609571500522488492528558513476485517544459531503477497577520534514481495495512515510525531537486498470483479454450487484499570514548455516487479482475473456710 569100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0037 095 18800000596 363 1180006 188 014 8560000000003 645 147 71200004 216 097 37500008 701 348 862000017 673 106 83900098 635 625 02400510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %919 449 36299.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %917 166 63099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %2 282 7320.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %462 558 93750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97 %897 658 47097 %3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6 %55 655 7756 %94 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

41 499 7951 582 2301 285 4821 703 1631 110 423905 2141 096 203998 357480 1341 260 776836 281826 4941 046 170991 369540 8131 161 714700 948655 303865 734897 696600 0661 039 930960 306873 2711 362 0332 011 672212 5503 736 618291 393269 258509 220509 453163 775705 139251 988243 534444 494553 103121 056826 81913 322 320722 6593 994 351959 305696 570543 191228 346235 463657 008865 4444 668 2101 388 3471 037 569998 215930 8393 278 0331 108 6311 092 935941 986841 433816 685 1133 3992 7292 6992 8552 9672 6252 7632 5612 630813 9000510152025303540455055606570Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.75%99.73%99.76%99.77%99.76%99.76%99.76%99.75%99.77%99.76%99.76%99.75%99.76%99.75%99.77%99.75%99.75%99.77%99.72%99.75%99.72%99.76%99.49%99.71%0.25%0.27%0.24%0.23%0.24%0.24%0.24%0.25%0.23%0.24%0.24%0.25%0.24%0.25%0.23%0.25%0.25%0.23%0.28%0.25%0.28%0.24%0.51%0.29%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped