European Genome-Phenome Archive

File Quality

File InformationEGAF00007836613

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 317 0611 489 0371 064 893938 349852 803820 449796 848799 120808 474840 574875 436918 465966 8381 026 3551 108 7651 193 0011 305 8091 433 4501 583 8781 772 1291 990 4932 252 3232 565 8772 957 4053 434 9964 007 3584 686 5935 515 4126 514 1167 697 6899 080 64310 700 50812 536 15914 645 83217 010 45819 666 01122 579 07825 760 46829 185 66432 776 73536 623 32740 566 53444 648 53948 780 95852 891 53356 898 60560 825 87464 557 43668 046 42171 286 09374 184 83576 729 72678 803 39280 477 82281 699 55382 502 25882 809 40482 634 83881 987 63680 955 06879 548 78077 725 35875 553 87573 082 79670 371 34567 447 90664 362 26861 153 42757 872 21154 485 57551 080 92347 758 34344 427 25341 174 27938 036 51234 979 23232 088 04329 325 46526 723 87224 246 54521 950 80319 846 06317 872 82816 018 17514 361 08112 829 88911 438 77010 173 3409 033 0237 992 9067 083 0626 263 8805 523 6954 872 0914 292 6543 781 1533 325 8062 930 6712 582 5372 283 9992 014 6721 781 7171 576 3201 402 8861 250 3851 114 4651 000 815896 993812 051735 295671 320615 671564 897521 524485 119451 025420 477395 737371 514352 616334 591319 509307 004291 317281 276268 146258 620250 822241 957232 100226 758219 752212 264206 483200 619194 998191 268184 976180 545175 790172 092168 066163 509160 678156 447152 222150 498145 957142 052140 231135 722133 352129 874127 023125 127123 192119 868117 329115 266112 738110 100108 452105 536103 624101 15699 51297 40095 78794 08192 95390 50188 82387 66086 28783 91881 77181 07578 54777 17276 83174 71974 06371 89071 44769 96868 36766 58165 88465 17463 75863 16362 52460 70759 96059 35758 14757 30755 85755 35054 32253 88852 82151 92351 03350 28949 91449 11048 77047 97447 12446 74346 04644 92144 42243 72842 69742 60441 75241 41641 31139 93339 88339 31138 71438 93838 34437 37137 10236 49036 01135 82035 14534 64134 75033 87533 84333 50433 42433 10732 76032 35031 66131 79231 07231 04330 32430 05529 87129 80529 51728 78228 40328 32328 24127 47427 12026 69026 69026 59526 58926 43226 23225 55825 84225 16525 34825 09825 08825 05025 10124 44724 41523 89223 68323 65823 13023 46423 08622 47622 89922 74722 49422 32022 43422 23522 27022 09622 03722 16021 71421 25621 16120 95620 96520 66320 78120 58919 98219 66020 08119 63119 51819 60919 38919 34419 15219 06118 81618 67718 52718 47218 02118 27617 89417 79817 78717 12817 19517 40116 91316 52216 41516 34116 45515 74115 70315 49315 46615 15015 44814 89715 07614 71814 80414 76014 36214 14314 08013 87013 96713 57013 46113 31613 04513 01212 82712 60512 61412 36812 34112 35312 08911 78411 79311 57211 52211 22111 29811 28011 23011 01310 93811 04710 61410 60510 29610 17910 18610 1639 9619 8969 9229 6489 3259 4859 8759 2599 1009 2629 0449 0519 1079 1748 7148 7588 6118 7578 5978 4078 4688 2588 3848 2078 1218 3728 0028 3028 0878 0868 0417 7077 6497 7057 7117 5967 5327 4057 5647 5537 2927 2827 3347 2957 1887 0947 0717 0067 0556 7636 7786 5366 4956 3896 5566 5426 5666 5486 5856 6136 3206 4506 6576 4696 3306 1986 0706 1506 0466 1156 0355 8905 9075 7905 8805 8795 7715 7985 6235 6025 8365 6655 6915 6165 4965 3895 3675 3305 2135 2595 1575 3075 2265 3265 1445 2205 0695 0404 8425 0104 8924 8855 1044 8944 9624 9434 8704 7834 9904 8304 7824 6914 7794 6844 7114 6034 6214 5634 5584 4354 6594 5864 5254 3604 4124 3234 4204 3744 3844 2604 3294 2714 2114 3104 2204 2774 1564 1064 0954 1134 0964 1493 9594 0483 9113 9873 9213 9323 8323 8373 8923 8483 8733 9103 6613 9043 7793 6963 6663 6583 7173 6333 6123 6643 5763 6723 6723 6043 4963 6583 7153 7593 7173 6203 6613 6323 5893 5703 5263 5083 5043 4513 4783 4103 7293 5403 3803 4443 3513 4963 4153 3963 3563 3753 2143 2773 1293 1763 1623 1783 1913 2143 1643 1343 0853 0833 2153 1683 1903 1653 0523 1052 9553 0883 0463 0383 0713 0233 0053 0712 9253 0293 0012 9842 8732 8922 9062 9372 8222 9332 9462 7792 7482 7382 7282 7162 7402 8102 5322 8482 6792 6622 6732 5922 6052 5622 5522 6222 5892 6982 5702 6482 5372 5352 4962 5252 4672 5712 5132 3782 4462 5102 3892 5032 3892 3692 4012 4482 3352 3152 3932 3592 3852 3922 3962 3292 3132 3232 2832 3322 3032 2202 2702 1862 1902 2512 1412 1422 1412 0242 0892 1292 0232 0792 0912 1282 0812 0862 1342 0762 1152 1062 1172 0042 0622 1742 0912 0542 1202 0902 0792 0752 0452 0831 9401 9432 0201 9762 0071 9201 9371 9071 9201 9861 8831 8752 0051 9081 9841 9691 8091 8511 8031 8691 8241 8391 7871 8571 8421 8411 8341 7971 8121 8411 7171 7911 7861 8031 8361 8311 7891 7431 7541 8571 7841 7751 7851 8121 7661 7641 7371 8531 8091 6651 7111 7761 6561 6531 6651 6351 6471 6351 7051 7761 6841 6871 6061 6931 5831 5681 7041 6551 6591 5221 5771 5681 5571 5071 5481 5901 5541 5381 6101 5201 5681 6241 5491 4971 6131 5311 5911 5891 4861 5461 4601 4441 5001 5001 5311 4941 5841 4371 5471 5711 5931 4891 4841 5211 5031 5151 4871 5101 5631 5701 5071 4861 5361 4631 4841 4721 4781 5091 4851 5201 4321 5021 4421 5001 4291 4021 4681 5411 3671 3741 4701 3841 4791 3811 3861 4161 3641 3971 3751 3511 4221 3301 4061 3901 4181 3781 4471 4501 3941 4321 3791 3701 3571 4421 2521 3111 3591 3761 3741 2831 3061 2731 3211 3621 3091 3051 2961 3361 2461 2741 3001 2451 2691 3051 2821 2301 2321 3411 3531 3111 3541 2511 3331 3721 3481 2521 3211 3491 2561 3151 3111 2961 3501 3581 2981 2081 2551 2401 2511 2151 2351 2121 2401 1921 2661 1831 1871 2921 1881 2381 1961 2511 1891 2421 1661 2071 1981 1771 1931 1871 1601 1831 1931 2241 2191 1621 1351 1431 2511 1041 1231 1981 1051 0971 1231 0791 1121 1441 1321 1061 1831 1221 1011 0531 0851 1041 0561 0761 0251 1571 1341 0141 0459811 0291 0761 0741 0451 0171 0091 0771 0201 0051 0861 0449971 0661 0909921 0211 0831 0481 0171 0121 0541 1159701 0861 0041 0081 0311 0531 0561 1041 0409911 0291 0101 0071 0201 0191 0159621 0071 0349739779871 0259721 0089579891 0209781 0151 0431 0121 0609811 0389369481 0029629921 0131 205 913100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

3 395 49800000000004 792 050 09500000000000007 544 593 63100000000000166 501 509 59400000510152025303540Phred quality score0G20G40G60G80G100G120G140G160G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %1 182 489 71599.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 181 737 39099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %752 3250.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %592 190 55950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97 %1 148 682 72097 %3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

47.5 %562 482 76847.5 %52.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

50 751 0571 178 0621 031 6315 203 3691 395 0381 142 1081 376 3362 054 558991 8061 351 576568 732624 409596 479934 163713 010910 278603 041734 979821 1821 242 1881 203 3041 268 9061 589 1891 220 1351 919 7153 273 186225 0575 658 728306 239278 917532 163599 028315 974731 702300 932305 108440 618629 831176 871980 31613 024 863651 879565 9631 092 202883 6501 666 5911 535 7562 066 7444 223 318385 650550 410433 413627 189243 453595 236490 656322 5801 486 654332 882778 1411 076 776 803051015202530354045505560Phred quality score0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.94%99.91%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.94%99.93%99.94%99.92%99.93%99.94%99.94%99.81%99.89%0.06%0.09%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.06%0.07%0.06%0.08%0.07%0.06%0.06%0.19%0.11%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped