European Genome-Phenome Archive

File Quality

File InformationEGAF00008236631

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

3 858 5432 793 9272 399 5412 137 3121 950 9371 818 7641 721 6081 649 4971 609 6991 571 5271 559 1331 565 4241 607 3941 701 5531 835 7732 028 5482 315 6262 690 4843 200 4313 828 9484 563 1075 414 1806 374 4997 368 9438 383 7979 368 08510 266 67511 017 05311 621 40912 036 64312 253 35912 280 37812 111 50511 849 23611 512 41311 164 23510 882 63610 734 87110 786 21611 092 30611 727 15612 723 39214 140 37415 992 86218 317 56721 103 05924 351 64428 087 72732 252 93336 781 43341 649 18446 767 12352 061 36757 459 26062 778 25367 929 95472 827 01277 328 40981 374 94584 821 85087 678 07289 794 94191 196 42891 823 54691 702 10090 850 54089 316 71187 164 95884 443 50681 196 16477 592 25773 662 78269 503 68265 201 73760 802 96856 429 86852 086 23447 852 53043 766 68939 858 01336 162 21232 732 74329 512 05326 542 35323 820 84821 331 40019 073 69717 028 82915 189 35713 543 60412 077 96410 759 0719 595 1488 568 2567 661 7656 864 4486 151 2685 522 6784 966 5624 468 6784 033 5993 646 2893 295 2522 980 7522 700 9502 447 4632 221 7952 018 9891 837 3421 676 1521 532 1201 400 9461 279 1351 168 1691 070 587982 298908 598841 381778 598722 615671 604624 668584 143550 793517 756487 578462 339440 373419 536398 779381 499365 593351 086339 043325 577315 762304 498297 094287 823278 422270 203261 999255 560249 713242 592236 541229 791224 749219 909214 902210 904204 809200 187195 371192 456187 291183 806179 535175 321171 627167 469162 835159 123156 375153 248150 050146 962143 919141 259138 220135 713132 177131 281128 334125 303122 935121 171118 322116 448114 290111 840109 534107 670105 089103 931101 628100 58299 35197 15195 15794 40192 66490 43189 02287 95686 60584 83083 18482 08880 45179 11877 79076 73775 19074 62272 32071 32069 90869 09468 30966 68665 89664 67163 18462 15360 85159 56058 77057 91456 75255 68255 13353 69952 84852 36251 20650 68950 06449 09248 15647 33447 11746 88246 03244 73744 24143 41243 35742 52542 42041 11240 52840 04039 86039 22038 97938 34138 00037 53137 15436 80636 24235 77135 28734 45134 47334 17033 51233 30132 86532 15532 12431 19531 19230 66330 63330 28729 93729 48930 00329 59029 40329 12128 65628 45127 95527 56827 88927 16926 97026 85826 89026 70426 54726 32926 14226 00225 75725 32825 80225 33224 95324 70024 79224 20624 27724 30323 92123 74323 63523 71623 45023 14923 12523 04022 99822 81522 89322 46321 99921 94121 88321 67821 31021 22720 80020 42020 35720 38020 34620 54620 37520 04719 81719 63619 62419 29018 87118 53118 25018 29817 94017 83417 56217 02817 37516 90016 63116 86017 01016 75416 21916 14516 30216 04015 77315 91715 67315 35015 16315 06814 85014 96914 70214 57114 33414 06314 11214 26814 07714 25213 99713 81313 64613 37113 13013 16913 03512 88313 15613 31212 98512 52712 78812 42812 17512 46511 98112 23011 83512 13311 73211 98011 65311 80611 45911 53011 48811 40711 24611 31611 18011 28610 99511 10410 66110 68710 47710 39810 23910 21210 11510 19510 08610 25210 0169 8959 9029 6419 5709 5049 2199 3089 2989 2509 0458 8548 7848 9298 6128 4528 6358 5958 6778 5548 5968 5138 5058 5848 5868 5828 6348 6768 3368 5038 2298 1998 2278 0878 2698 0738 1577 9287 6918 0107 7947 6647 8967 9747 5477 7487 7097 6107 4547 5677 3197 4387 5627 2457 2407 3777 2927 2037 1337 1327 1956 8426 9396 9396 7466 6986 6326 5586 5836 6006 4506 3916 2606 2286 1296 1496 1836 2206 2056 1766 2065 9465 9195 8045 8955 8705 9216 0375 7455 9225 7585 7615 8035 7955 7655 7565 8005 5875 5995 6405 9505 5955 6155 6815 6615 5935 5595 5675 3875 2155 2805 2005 2565 1045 1475 1185 0125 0495 0244 8684 9395 0314 7944 9204 9294 9634 8954 9535 0615 0334 8094 7974 7094 7894 7814 7454 6414 6284 7554 6674 5884 7934 5194 5084 5984 5244 4594 4784 4844 3784 5594 4534 4534 2994 1614 2054 2594 2764 3924 1124 1834 1314 1194 2464 1964 1143 9904 0534 0834 1204 0274 2304 0914 0664 0574 0033 9964 1184 0923 9444 0093 9213 9013 7133 8623 7543 6413 7213 5933 7353 8493 7223 7273 7263 6993 7083 6883 6453 7733 6353 6603 5063 5683 7583 6713 5223 5783 5703 5403 5103 4203 4733 3803 3583 4983 4713 3863 4433 3663 3993 4293 3543 3133 2893 2953 3143 2953 2753 2903 3433 3183 3183 1263 2343 1783 2133 2603 1133 2903 1973 1423 0843 2213 1413 0343 1373 0893 0623 0362 9573 0063 0043 0943 0082 9082 9292 9262 9422 9042 7592 9452 7442 8082 8802 8752 8672 7552 7992 7912 7442 7472 6882 6952 6542 6282 6802 5372 6132 6382 5702 5672 5992 6182 7072 5662 6262 7192 6312 6722 5442 6182 5532 5612 5772 5182 5542 5732 5662 5472 4792 4882 5132 5702 4732 4812 3702 3812 4422 4562 4082 4862 3242 4232 3452 3732 3352 2992 3262 4292 3492 3162 3622 3732 3572 3482 4502 4542 3122 4012 2932 3812 2862 3572 3532 2482 3442 3732 3612 2972 2262 0652 2212 1822 2072 2382 2052 2562 3092 1982 2662 2412 1912 1502 2602 1762 2422 1332 3312 1802 2212 2162 1432 1402 2692 2062 1642 1802 2392 1892 1882 1862 1122 0612 0412 1132 2322 1282 1662 2682 2012 1152 1702 1622 1662 0962 0582 1542 1052 1382 1222 0292 0982 0692 0602 0542 0791 9061 9071 8731 9081 9091 8741 8741 8341 8311 9281 8411 9201 8281 8611 8201 8911 9381 8911 8071 8221 9101 9451 8941 7621 8361 8441 7621 7201 6851 8011 7281 7521 8251 8361 7491 8151 7591 8241 7731 7511 8031 8021 8121 7411 7881 7601 7701 7601 7521 7361 7341 7311 7531 7651 6681 6861 6311 6991 7181 6691 6581 6701 6801 5771 6231 5641 6131 7131 6331 6171 6051 6031 7051 5601 6621 5551 5781 6441 6351 6211 5741 5391 6181 6241 5561 5761 5681 6541 6361 6121 6101 5341 6031 5051 5581 5601 4841 4951 4751 4871 4491 3961 4341 5291 5101 5061 5311 4521 4381 4851 3961 3421 4701 4391 4221 4431 4101 3481 4151 4231 4041 4581 4111 4081 4401 3601 3701 3391 3181 4321 4211 3441 3811 3131 3511 3051 3201 3551 4091 3601 4111 3751 3141 3191 3371 3461 3521 3621 4061 3771 3161 2711 3391 3691 3471 2991 2871 3491 2351 2991 2671 4061 3081 3131 3401 2751 2781 2291 2141 2831 2711 2941 2581 2651 2781 3521 2941 2101 2221 3171 1621 3021 2601 2501 3071 2841 2651 2901 2131 2831 2431 1741 2491 1901 2101 1851 1811 2261 1971 353 109100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

30 072 90100000000007 610 474 894000000000000010 953 520 04300000000000177 795 278 25800000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 298 361 95999.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %1 296 729 13899.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 632 8210.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %650 295 84850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %1 274 684 17698 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.8 %49 503 1003.8 %96.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

65 352 9751 408 3121 173 2581 634 6381 299 7491 338 6221 491 0271 770 4931 246 0391 138 699650 349564 048765 860818 978666 8421 102 070858 271925 760919 9741 058 4861 075 8531 126 9741 464 6751 070 0521 599 7702 454 840266 2595 287 238308 482284 761607 919489 160315 054632 922289 413269 470428 697501 066191 301810 77314 182 115734 337608 9091 140 841947 9231 589 8861 822 5121 509 8953 409 388371 654568 153448 981622 797339 628590 172598 841415 9761 610 173393 434819 7061 177 388 858051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.88%99.86%99.89%99.88%99.88%99.88%99.88%99.88%99.87%99.88%99.88%99.88%99.89%99.88%99.87%99.87%99.87%99.88%99.86%99.87%99.85%99.87%99.84%99.69%0.12%0.14%0.11%0.12%0.12%0.12%0.12%0.12%0.13%0.12%0.12%0.12%0.11%0.12%0.13%0.13%0.13%0.12%0.14%0.13%0.15%0.13%0.16%0.31%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped