European Genome-Phenome Archive

File Quality

File InformationEGAF00007988918

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

4 170 7723 180 4352 765 0902 579 3632 472 0732 446 0342 453 3192 506 6192 587 0012 724 1272 879 7713 068 3513 315 3713 642 1474 074 2374 645 2185 348 8296 224 9897 258 1588 456 4989 753 93411 135 03412 464 09013 770 50114 944 56615 972 74416 802 59917 410 30017 867 61918 227 43418 514 53718 827 03719 217 47219 831 37020 715 88921 960 97623 661 13825 870 23528 633 53331 992 99335 972 95440 533 69945 728 45551 399 65857 510 95663 893 76970 439 92976 956 37483 262 10289 118 08894 406 19098 825 296102 362 689104 797 623106 052 145106 138 271105 007 755102 758 74199 485 20195 276 01490 275 38784 630 82078 580 07772 235 53665 721 70759 279 04452 990 18546 974 98441 299 22136 048 14331 237 05426 887 37023 007 39219 581 52216 593 72713 998 59611 770 2629 891 5228 293 2066 958 4495 842 9954 908 8784 149 3963 511 8412 992 8992 560 9552 205 1101 910 8561 665 6301 465 0941 295 5581 153 6101 034 698934 685849 470778 786715 309661 351617 503577 591547 246517 747493 523472 575452 313434 864419 055405 441395 237382 860372 757364 110352 747345 355337 852328 222320 718312 441306 692299 886292 042285 096278 333271 510265 112260 989252 891246 971240 229234 942229 185223 351217 725213 149207 180202 752198 602191 657189 577184 629180 992176 587173 463170 017165 598162 086159 059155 487152 230149 495147 209145 620141 706140 320137 047135 162132 862131 029129 782126 346125 864123 354122 465119 944117 428115 728113 948111 687110 687108 216105 597104 335103 313101 557100 53598 91797 94895 57794 48892 40591 09289 71487 95586 00184 28381 98281 25379 74278 16176 94676 24774 49173 55072 18270 73069 90569 18667 95266 42365 99564 75964 18262 95461 67861 07960 25758 47758 24257 30956 55855 64155 10854 62853 56052 54952 18551 49451 31650 66950 66749 73549 28748 70647 94246 69346 08446 30845 95444 97244 15043 67943 22842 82042 26142 01741 62441 24640 40939 84239 91239 15439 36939 27438 46338 07837 35837 04836 76236 19536 04435 88835 28934 80733 91833 66733 59533 50933 95033 20932 53732 42732 46532 33431 87631 57131 27031 32630 89330 46930 25830 37130 27229 61629 68329 17328 94428 52728 28528 51127 98427 68627 49227 31327 55626 70126 72426 20026 13025 66125 74225 52825 48125 18225 04624 85024 33824 35224 45923 54623 72223 80423 42923 30323 00722 55522 39722 28421 87821 81721 75721 65521 70921 21321 15720 71120 59119 88219 99619 87519 42219 33519 07018 75219 18918 31518 58318 45418 13317 93817 80717 78317 73217 27017 35317 47617 13317 26317 20816 84116 65816 14915 82816 13315 84715 85015 28815 67015 41115 61315 32615 21515 06014 96714 58914 48414 66814 50214 26514 23214 19813 97013 61813 84513 94113 57613 58113 85113 70613 69313 45813 34413 48613 15812 95212 97812 92112 47712 46712 73612 43012 47112 72412 32912 22412 22711 98912 11211 86712 02811 85811 66411 73412 10811 87011 66511 35711 28911 00510 95411 09910 87411 01211 06210 81210 56510 64310 72510 65810 53310 50910 26910 37610 33710 45110 06410 1049 95910 1309 7929 7069 7299 5639 6829 6559 7009 7779 5659 4289 3609 3479 4029 2139 2629 2999 2249 1769 1709 1209 0548 9228 8578 8898 7878 6988 9358 6868 5908 7848 4008 6558 3718 5258 3148 2788 1317 9817 8957 9458 0857 8067 9217 8047 3767 6057 4427 4077 3207 3007 2227 4367 2427 3707 4117 1447 3637 1926 8487 0176 9666 8766 8566 8666 9026 7436 8316 6326 6826 4156 5586 5706 5716 5946 3816 4886 3756 4786 4016 3216 2726 1896 1156 1206 1186 0226 0076 2236 0315 9545 8875 6855 8595 7545 8575 8725 8395 7345 8235 6545 5795 4265 6335 6195 3585 5385 4505 4465 3685 3515 3985 2655 2255 1935 2545 1965 1875 0185 2115 2265 2165 1965 2295 0125 0265 0544 8944 9564 8864 8774 9454 9774 9894 8784 7884 8894 6714 7064 6594 7274 5964 6224 4944 6214 6174 6564 6634 5344 4214 4524 6064 4854 2854 2434 3344 2354 2364 1784 1824 1794 3684 0934 2494 0364 0924 1354 0264 0633 9664 0323 9263 8573 8143 9183 7973 8053 8023 7883 8033 7773 8443 8073 7663 8203 6993 7493 7863 6203 6963 7473 6653 6613 6333 5953 6003 6573 6563 5493 5163 5533 4933 6023 4353 4773 3563 3743 5013 3973 4133 4013 3183 2753 3543 3423 1673 6183 3563 3823 2993 3793 4383 2683 2023 1213 2053 1073 1743 0493 0883 2223 1153 1753 1073 1473 0553 2493 2363 2073 0523 0703 0382 9552 9953 0482 9432 9902 8883 0602 8122 9662 9112 9272 9832 9232 9942 9052 9863 0152 8942 9332 9402 9642 9862 9692 7872 8422 8222 7542 8262 9212 8002 8752 8292 8062 8152 7212 9922 7992 7212 8142 9502 7322 7042 7652 5922 6452 6472 6582 7802 5192 6372 6832 6482 6202 6532 6452 5032 4912 5722 6632 6182 6352 5932 5972 5922 6732 6222 6092 4782 4962 5562 4622 5332 5652 5562 4372 4632 3642 4052 4332 5342 4692 4072 4592 4492 4992 3272 3512 5372 3472 3292 3402 2402 1882 2742 2592 3012 2342 2252 2962 2492 3192 2452 2572 2032 0982 1312 1062 0512 1692 1822 2622 2102 1822 1212 0942 1832 0852 0862 1042 1452 1402 2062 1842 1702 1552 1852 0612 1512 0311 9812 0262 2522 0472 0472 0802 0191 9772 0132 0791 9922 1032 0742 0412 0622 0241 9561 9582 0042 0051 9511 9371 9822 0831 8641 9471 8661 9011 8871 8532 0031 9411 9291 8331 8561 9311 8971 8631 7911 8511 8381 7571 8011 8021 7591 8441 7611 8071 7191 7501 7301 8161 8111 8021 8091 7971 8561 7151 7261 8051 7581 7811 7641 7511 6841 6951 7411 6941 6931 7081 7161 6751 6791 6431 6731 6131 6511 6101 6311 7261 7231 6381 6651 6771 7041 6151 5931 6021 6211 5931 6661 6241 6981 6071 6591 5791 7431 7371 6651 5881 5951 5871 5941 6201 5411 5711 5491 5641 5141 5841 5901 5711 4911 5451 5751 5741 5471 4691 5071 4481 4961 4791 4031 4941 4751 4661 3971 5151 4861 4251 3991 4201 3921 3311 3801 4371 3941 4691 4031 3641 2481 5761 3931 4021 4171 4371 4621 4491 3921 2991 3751 3371 4641 3741 3871 3481 3551 3961 3151 3521 3711 3861 3771 3361 3611 3761 4751 3691 3651 3061 3171 3811 3841 3331 2481 2371 3531 3041 3161 3031 2491 3021 2211 2961 2941 3491 2581 2931 2841 2091 3321 2581 3011 2961 2741 2481 2751 2611 3101 3261 3951 3131 3101 2011 2331 2651 1981 2091 2281 2621 2201 2581 362 507100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 655 59400000000005 349 314 28700000000000008 090 916 49400000000000151 362 584 05900000510152025303540Phred quality score0G20G40G60G80G100G120G140G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %1 090 022 40099.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %1 089 013 38499.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %1 009 0160.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %545 710 16750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %1 070 310 40498.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9 %98 445 8529 %91 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

62 789 7381 376 3501 129 1471 700 0881 253 5621 281 7801 403 5931 419 0531 181 0581 063 633656 544569 853719 681807 971657 772995 367854 229920 619860 314989 3841 009 6711 031 3511 417 6091 056 9741 502 4132 166 623244 2124 488 428285 567258 412521 813437 144280 091545 086255 866237 377357 777428 690165 452686 36812 041 404614 584490 098954 917788 2901 314 7561 544 9841 234 7562 949 458285 350463 947340 130505 001250 475429 231436 449306 0721 317 246286 385648 803978 379 964051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M900M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.9%99.92%99.92%99.92%99.91%99.91%99.91%99.91%99.91%99.91%99.91%99.92%99.91%99.91%99.9%99.9%99.92%99.89%99.9%99.88%99.9%99.88%99.53%0.09%0.1%0.08%0.08%0.08%0.09%0.09%0.09%0.09%0.09%0.09%0.09%0.08%0.09%0.09%0.1%0.1%0.08%0.11%0.1%0.12%0.1%0.12%0.47%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped