European Genome-Phenome Archive

File Quality

File InformationEGAF00001767567

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

430 223 128197 618 714115 232 21876 758 43555 633 28142 583 43433 983 33927 905 59423 487 24520 169 20117 530 35115 441 57113 720 25012 308 07611 128 11510 124 7599 268 1888 506 9517 842 5877 267 3436 758 8806 281 4015 862 1795 508 5755 161 8554 859 5234 584 0784 330 6714 099 2883 883 8003 689 8983 516 4583 352 7963 199 9443 062 9452 926 6892 807 9152 696 3952 584 2142 479 3772 389 0342 296 9502 211 3362 132 4782 056 6571 980 8881 921 0471 861 0901 800 1101 741 1791 681 8001 631 3661 583 1301 536 1011 492 9751 456 5951 410 3901 367 6761 328 3421 292 4761 259 0961 226 2771 193 3771 162 6511 132 2211 102 7041 077 8611 050 5541 025 5151 000 683980 031954 575931 070910 611891 752873 097852 066832 726815 617799 531778 791760 492744 622728 503714 506704 379690 862677 348665 336654 640641 952630 788618 262605 479595 279584 465574 382562 985552 556542 309533 102523 620515 420505 037498 841488 293478 709470 359462 468454 934445 607439 961431 326422 888417 764409 602402 622395 298389 597382 270378 249371 344366 524361 465354 817347 870342 445336 170330 420325 601319 062313 641308 148305 809301 359296 705292 404287 877283 729279 340276 100271 463267 639264 791261 149256 873253 621249 828246 931243 161239 271237 525234 034230 309227 203224 603221 880218 394216 553213 139210 324209 235206 251202 875200 920198 287197 098194 554192 899190 175185 986184 037182 070179 628177 445173 890171 889170 925167 673165 813162 789161 350159 167157 042155 527154 045152 999151 773149 036147 185144 975142 887142 615140 896139 390137 589135 486133 628131 268130 306128 900127 713126 301123 846123 439121 512120 740119 136118 194117 509115 396114 509113 016112 274110 513109 358108 097107 253106 134105 141104 497104 182102 730101 753100 697100 27298 76998 12296 91995 91895 51294 01193 24092 61791 97490 46890 06888 85688 65888 16187 86886 77486 82585 65885 23384 06183 59982 82581 93181 17680 78080 52979 29878 90277 98077 42776 80976 13075 89875 05573 90073 29872 48671 79371 03570 58769 88369 44169 37668 04168 30267 25866 86066 25165 44764 63063 98563 71862 80062 03861 89960 95460 23059 78559 35758 64258 73558 31957 42557 08656 46356 44555 75655 65455 64055 19954 48454 44753 25352 87152 75251 95352 08851 53950 74751 07750 49249 86749 43449 50048 19047 87647 56247 56647 23247 26746 35846 12745 07645 18244 69744 28043 32343 30843 25043 28243 16842 76942 66742 37142 23841 28741 42041 07840 99840 43640 54539 99639 50039 25739 23539 00138 95738 42638 19638 07237 94338 13937 75538 14437 15437 26636 67636 47036 35336 29036 13236 19635 61935 42635 22635 19934 74934 87434 20534 14033 78233 68633 42733 23133 15932 84532 53632 38732 19531 91831 71031 18030 98330 91130 45730 50430 03730 04129 93329 43929 16328 98728 94228 77129 04928 35728 52628 34628 45827 96227 63527 35527 21427 18426 60126 64726 20225 72525 65525 66925 46324 94124 94024 70824 57624 60624 09624 30623 89823 83523 83423 84523 67623 47823 55023 14123 15223 05223 12022 78722 54822 26322 37322 18822 15121 75921 39821 41421 24121 26320 89121 00121 05620 58520 76920 53020 27920 07219 99420 06019 61319 57219 58019 45718 99519 15419 00918 93818 86619 01418 76318 97218 71218 76418 58318 46518 32918 25918 26718 01218 22817 64517 55217 61917 59717 52517 12117 35917 25117 25916 89716 78316 79116 88516 43116 91016 57916 55316 56316 27216 43015 98016 38116 20716 26016 12416 26015 94515 99315 93015 71515 83415 57815 76915 48315 51515 83715 25415 27915 42515 09515 19015 03615 04914 98214 65514 77414 53914 14814 15914 16813 99313 96813 90013 91513 72713 91313 69813 52813 37613 48713 40913 28113 21913 11313 19313 11612 99313 02313 01713 02913 09913 21713 21112 94612 97012 86612 77712 76512 86412 65012 64612 49412 44012 26512 35812 12911 98212 02111 81911 81111 91611 69511 66911 80311 64211 58411 67411 58911 38711 59311 62611 44211 39311 58111 34611 46211 45211 39511 44011 39511 30811 12211 17311 15311 14010 97210 92710 79010 80710 83810 69610 66410 63310 68710 59810 42010 26810 36910 43610 29210 01810 18910 1309 98510 0979 92810 06910 01110 0409 8799 8409 9209 6669 8269 7619 5239 6519 7359 5779 7059 4159 3769 4029 2139 3369 2499 2839 1839 0128 9528 8648 7788 7658 7028 7098 7478 6748 5158 6098 4548 3658 3388 3668 2688 4207 9287 9847 8617 8797 8077 7207 8497 6027 5007 5857 5477 4537 4947 2787 2657 2127 1797 1037 1197 0927 0636 9816 9927 0686 8406 7836 9126 6776 6576 5906 4416 3406 2936 3586 3666 4326 2996 3406 0776 2996 1996 0705 8606 0136 0746 0436 0935 9896 0465 9655 9455 7955 7985 9835 8866 0045 7985 8705 8345 8475 5585 6605 6235 5655 6425 4905 5855 4975 5635 5835 4745 5385 3305 5615 4465 3085 3015 2545 2795 2705 1425 3185 2275 0835 1055 0575 2645 2225 1335 0755 1065 2155 0875 1505 2605 2195 1795 1735 1085 2455 0635 1875 1115 1135 1075 2205 1765 0894 9394 9494 9604 8264 8774 8694 8864 8014 9494 9084 8874 8914 9884 8344 8044 7924 7724 8134 8424 8394 6894 7404 8444 7684 7734 7404 7954 6064 5524 5684 5744 6224 5104 4824 3224 4794 5044 2874 2574 3214 2884 2664 1804 3534 2764 2874 2494 2834 1824 1534 2464 1814 1904 2944 1464 0854 2034 2734 2634 2574 1104 1023 9784 2034 1154 0263 9473 9644 0814 0353 9823 8923 8843 8723 9163 9183 9434 0373 8673 9773 9313 9333 8603 8794 0213 9053 8463 8113 8143 7123 7633 6873 8053 6923 5533 7403 6913 7403 6923 7563 5563 5073 6453 5363 4413 6323 4783 5583 3593 4433 4433 3563 4533 3733 4983 4523 4323 4683 4063 3863 3733 3573 3363 3333 2913 2733 2993 2403 2533 2293 1803 0933 2123 0813 1263 0303 1323 1953 0953 1673 1403 1083 1573 0863 1853 0343 1793 0073 0773 2003 1453 2113 1493 0553 1003 0973 0052 9422 9742 8552 9022 9232 9312 9162 8733 0462 8722 8722 8092 8152 8292 8082 7942 7432 6812 7842 6872 7362 7262 7552 8182 6672 7622 7262 7682 6812 7602 6662 7772 6772 7632 6892 6762 6802 7122 7392 7792 6652 5872 6382 6552 6532 4952 4672 4532 5092 4492 4732 5022 4212 4172 5122 4362 4262 4282 5152 3982 4202 4152 4552 4452 3972 4322 3932 3532 4132 3682 3012 3372 3632 3272 2862 2752 2842 2462 2552 3112 2772 3082 3722 3162 2642 2602 2872 2242 1982 2322 2332 2072 098753 057100200300400500600700800900>1000Coverage value10k20k100k200k1M2M10M20M100M200M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 284 60800000039 393 130000734 493 2640000000000451 307 1360000522 964 09200001 443 713 75700002 807 424 298000016 204 704 40300510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %147 009 748100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %146 972 74899.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %37 0000 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %73 530 74450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97 %142 633 70897 %3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

14.8 %21 809 44314.8 %85.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 633 528151 95161 636209 64369 13471 929294 85393 88472 618111 84546 87939 086129 21948 32335 94669 71842 50646 99978 96952 53447 37768 56287 02769 678140 103152 15726 971391 59327 59024 48470 38139 90447 73947 87630 97528 51448 44241 35824 57668 0301 233 20075 46482 089120 15693 399152 896151 540178 689401 80240 41167 97044 58975 51433 41880 08345 40141 051136 09939 11689 585140 257 427051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.98%99.97%99.98%99.97%99.98%99.98%99.97%99.95%99.96%99.97%99.86%99.97%99.97%99.97%99.83%99.95%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.02%0.03%0.02%0.03%0.02%0.02%0.03%0.05%0.04%0.03%0.14%0.03%0.03%0.03%0.17%0.05%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped