European Genome-Phenome Archive

File Quality

File InformationEGAF00008201863

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

3 750 2352 741 1192 376 2892 179 6292 030 9591 925 4171 844 9051 778 4971 730 8831 674 5931 643 5811 636 8611 623 8561 632 2291 649 6541 679 4621 740 7551 821 9431 931 7422 064 4072 234 3932 445 5372 694 3532 984 2593 327 2753 726 3374 195 0734 727 3785 326 4555 994 7456 730 6327 525 9728 386 4899 292 69510 264 44711 266 29112 284 55013 315 87414 349 66015 385 11616 405 51817 395 68418 362 75619 328 62820 234 89521 136 98322 030 97222 928 01723 798 56524 677 50825 575 63426 493 04927 429 59528 397 53529 389 12530 415 01831 463 10832 522 86533 581 12134 678 63835 767 16836 823 06837 894 91438 958 06539 974 65840 980 85541 928 11642 873 76343 736 23344 578 64445 370 40446 114 27546 801 88947 420 37447 994 48248 475 23848 915 55049 295 05349 581 83049 790 73349 946 52350 005 03849 989 62049 868 08149 652 20249 348 04048 954 63548 460 41247 840 06547 142 31046 335 46945 416 12044 412 02843 301 13442 085 65640 763 87939 382 86237 935 03436 428 48234 845 05733 239 70931 615 09829 961 51128 312 08326 672 72325 021 98523 426 16421 854 14720 341 47218 872 14117 456 11716 097 74514 820 61413 608 47912 467 79111 393 47910 395 4999 468 8528 606 2637 812 5147 088 6686 425 1385 813 6705 260 3134 751 6854 295 9413 882 8213 513 5363 179 2522 875 6072 603 0942 355 4632 131 8371 933 3491 755 7601 594 0561 445 9481 321 6191 199 9511 097 4181 003 959920 522843 722775 026715 181660 371610 652565 571525 211489 217455 579424 766396 805372 939351 368329 162310 337293 326279 541265 536252 247242 007232 402223 838213 996206 461200 229193 406187 322181 523174 938171 081165 483161 772158 617154 570150 269147 151144 219142 385140 314137 154134 615132 355130 324128 697126 474123 370122 363120 223117 606115 812113 495111 967110 638108 604106 835104 773102 405100 52299 41998 65496 82395 26493 75591 84091 74089 43988 70487 01385 58484 56683 10482 23981 02680 12978 01777 47475 99974 74672 99472 67871 00369 77569 28067 43666 88965 82064 39663 93763 40662 64161 39460 01859 55059 16058 10457 20856 35855 62754 40653 89853 07952 69452 15651 02449 91049 69849 39148 58648 56948 45147 60246 36745 68244 68644 04644 25043 86443 55143 12343 25942 36241 35641 10940 63140 56739 57638 98239 03738 58638 13038 11537 53537 24337 18936 07335 63535 52735 16834 96534 49734 25033 60333 46032 82832 57932 58832 00431 79131 50331 48631 30531 02930 75630 22930 10129 69229 40829 55628 77228 40128 10727 79127 81427 56627 10526 87926 37326 95926 37426 20525 45625 36025 50225 06824 92324 96824 83824 73124 30423 90523 83723 49423 46423 06222 45722 54722 26522 29421 68621 58621 35921 18121 04221 18520 68820 79720 80820 42420 35620 10719 64019 53319 21619 22819 35118 64218 74318 76318 20018 18517 88917 41617 37317 38416 95517 05517 08116 96716 77516 47116 63516 38615 93716 12115 61515 65015 20815 44815 45715 08815 24015 24315 03815 13414 84114 60614 60914 15214 24113 98513 96813 62313 40913 40113 32313 25113 16713 20613 07312 87413 08612 93012 76012 68412 59912 38712 39612 38712 12812 15511 88111 70811 79711 63511 34811 14911 05311 16211 48711 09110 99811 01310 79110 58810 46910 56610 34710 19310 30510 11510 17010 18810 0229 9379 9199 77910 0249 8629 6429 6049 4239 4839 2839 2059 0619 3569 1029 0138 9188 8248 6628 7298 5858 5968 5228 3748 3308 0107 9777 8887 8137 7837 7727 9417 9857 8047 6577 7247 8217 6567 5857 4047 3327 4117 3527 3697 1937 1857 1927 0306 9686 8686 9386 9806 8026 9116 5836 7206 7356 5146 4866 4346 3626 4186 5796 4576 5186 4706 3316 3796 2056 1236 1535 9276 0025 8665 9695 8915 8435 7705 5375 6525 7875 7905 6975 5785 5375 4725 5295 3425 4505 4755 3835 4225 4085 4015 4005 2545 1265 2895 1795 1575 2665 2625 1795 1085 1035 1875 0864 9775 1654 9575 0324 9844 8114 7744 7954 6794 6664 5764 6134 7014 5934 4454 4354 3984 5164 3574 5384 2894 4944 3704 1344 1284 2964 0914 1104 2014 0093 8584 0343 9653 9614 1184 0104 1214 0813 9494 0454 0944 0184 0914 0053 9213 9343 8783 8003 9803 8943 8133 8723 8673 9403 9343 8153 9163 8653 8153 9183 9173 6613 6183 8273 7773 8543 8593 6553 6753 6143 5323 5503 4043 5783 5093 4673 4583 4653 5783 4843 4553 4593 6163 4463 6473 5173 4203 4083 4803 4383 4833 6413 3833 4613 4343 4113 4693 2183 3243 1943 1853 3393 2253 1473 1103 1963 2163 2373 1153 2903 2443 2153 1983 1932 9893 1673 0563 0092 9632 8992 9843 0012 9572 9992 8382 8202 7732 8012 8112 7442 8452 8482 8002 8292 8812 7882 8612 7462 7622 7972 7532 7112 8002 7612 6862 7532 6362 7362 6892 7302 7152 7152 7672 5252 6032 6012 4992 6432 4722 4652 3632 5082 3772 5012 4722 4142 3782 5392 4342 4222 3152 3212 3312 2742 3032 3002 3092 2012 1492 2242 2982 3132 2022 4702 2642 2982 3162 2922 2552 2712 3242 2302 2392 3632 2062 2462 4362 2452 2052 3122 2592 1732 1652 1522 1192 1342 3182 2162 2752 2842 2682 2282 2212 2122 2672 2682 2642 2252 1972 1052 1442 0902 0651 8951 9851 9722 1241 8712 0071 9542 0321 9301 9522 0272 0371 8301 9301 9641 9311 8821 8461 8701 9181 8511 8321 8031 8981 9341 8001 8091 9301 8431 8561 8871 7961 9111 8301 8021 7711 7711 8501 7321 8221 8061 7421 7811 7641 7471 8161 8361 8611 7061 7481 7311 7331 7651 7981 7091 7661 8521 8381 6591 6891 6091 7781 7491 5791 6631 6251 5961 6771 5931 6931 6221 6221 6181 5431 5301 6121 5671 6251 5711 5621 7401 6271 6001 6121 6941 5761 6221 5741 6071 5351 6081 5131 5301 4881 5061 5841 5801 5161 4771 5461 5221 5061 4761 3861 4001 4471 4701 5181 4961 5001 4451 4431 4171 4511 3541 4361 4271 3901 4241 5111 4871 4741 4781 4641 3951 3921 3251 4141 4011 3591 4071 3631 3851 3471 3971 3131 3481 3261 3811 3391 4381 3771 3391 3011 3671 3251 3591 3791 3971 4021 3391 3641 3481 4291 4691 3691 4071 3771 4311 3811 3321 3551 3091 3001 2701 3461 3151 2881 2891 2581 3271 2241 3271 3091 2521 2981 2291 2851 3161 2671 3111 3051 2561 2901 3111 2681 2641 2831 2341 2661 2931 3071 2561 1961 2651 2901 2551 2741 2281 2081 2871 2451 2721 2171 2421 2281 1321 1881 1641 1161 1531 1131 1911 1801 1381 2131 1681 1921 1221 1671 1481 1701 1571 1381 1541 1251 1921 1291 1621 1651 1471 1421 1201 1001 1541 1341 0621 1291 1261 1321 1251 1201 1171 0861 0861 1181 279 332100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 475 02200000000005 921 995 99000000000000009 857 876 28800000000000220 404 158 56800000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G180G200G220G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %1 562 131 78599.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 560 487 48899.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 644 2970.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %782 077 83450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %1 531 368 19697.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

6.7 %104 780 2066.7 %93.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

64 216 7201 442 7231 176 1301 783 8751 331 3131 394 7701 511 0592 045 7141 358 3571 203 400637 633553 186731 479795 960645 1131 124 778849 249930 414899 4921 057 1261 057 2741 210 6511 531 8651 137 0181 694 6532 709 769276 2926 058 273322 945299 147671 550531 841360 857701 817296 355276 930419 906518 215196 971929 81616 305 994840 947617 5511 297 6931 046 0561 744 1072 171 5841 690 4524 494 605364 449646 390451 109707 376306 505569 271549 550387 2031 893 853355 143883 6191 436 628 911051015202530354045505560Phred quality score0.2G0.4G0.6G0.8G1G1.2G1.4G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.89%99.9%99.91%99.9%99.9%99.9%99.9%99.9%99.89%99.9%99.9%99.91%99.9%99.89%99.89%99.88%99.9%99.87%99.88%99.88%99.89%99.88%99.68%0.1%0.11%0.1%0.09%0.1%0.1%0.1%0.1%0.1%0.11%0.1%0.1%0.09%0.1%0.11%0.11%0.12%0.1%0.13%0.12%0.12%0.11%0.12%0.32%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped