European Genome-Phenome Archive

File Quality

File InformationEGAF00004718540

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

65 494 03230 082 14616 538 02610 466 0017 403 3275 696 8784 674 0183 994 3613 499 5683 127 9202 816 2072 543 3842 302 3942 083 5001 881 5721 700 1731 527 2101 374 2211 232 9951 107 996992 820892 376797 718722 266651 981593 454540 545496 590458 186420 003390 791363 963341 574324 190300 395282 424267 285251 565239 948229 592216 224205 148195 788188 428178 543170 844163 467157 170152 639144 903139 404134 047129 391123 663120 434115 625111 107107 359104 414100 97196 71094 24792 06089 03686 40883 92481 45079 81176 99775 75572 78272 18969 20168 46966 35464 24663 26462 05160 32357 57256 53754 93754 54152 64152 11050 36949 59848 12147 42146 52146 06344 68943 64242 25941 62240 61940 18139 07138 72038 02936 93236 42635 86635 06234 11733 82033 19632 80832 30432 40630 96730 75530 69329 80829 02228 73228 59528 16027 30426 89126 66625 93525 89525 51525 32124 79124 36024 27423 65823 52922 91822 14222 46021 77621 70721 44421 14920 86521 01920 33819 93719 53619 38319 11919 06518 81018 74618 36318 24318 02317 38217 34417 18316 86016 65516 29416 42516 22416 03115 30815 43115 30114 89815 05414 58914 72114 44614 42514 01014 08213 99513 52613 27813 47413 57813 37713 38013 04612 80713 05412 82712 59312 40912 33112 49111 85911 86911 56511 86811 63411 62911 55111 37810 89210 96110 87610 62210 55910 67010 34310 05410 34810 23610 2379 7829 7499 8439 9729 5209 3919 3449 0239 1329 1309 0109 1478 9018 7088 6438 6208 6018 4938 6738 2978 4558 2678 4458 2677 9788 1468 0137 8127 9867 7117 6777 6047 5597 5047 4097 6407 3277 5117 2657 0866 9707 0276 6936 9816 9476 8756 8086 6406 7096 8006 8376 5866 6926 3436 4536 5656 3496 4156 3676 3066 1596 0656 0706 1075 9435 8116 0665 9365 8015 9145 7725 8455 6155 5545 5055 7735 5585 4855 4475 4235 3825 2155 1855 2675 2065 1625 1545 0855 1565 0915 1244 9844 8974 8804 8144 8714 8954 8574 8324 7204 6334 6394 6884 6544 5344 4744 5484 3384 3984 5024 3534 4094 4154 4524 2064 2004 3154 1364 3014 1334 1514 0424 0893 9933 9874 1433 9623 9464 0913 8253 9503 8713 7243 8303 7653 7933 8433 8443 8453 6633 6413 7303 7483 6223 6863 6593 5243 5903 6013 5663 5743 5793 5413 4083 4803 5463 3423 4663 5553 2683 5983 4713 3343 4263 5153 4113 3093 2403 1943 2203 1723 2333 0853 0853 2083 2553 0813 0853 0932 9963 0792 9883 0073 0063 0402 8632 8142 8022 9812 8872 9852 8042 8582 9182 8412 8642 7132 7802 8122 8562 6612 7542 8162 8202 8042 6552 7692 6952 6872 6742 7072 6802 6562 6192 5852 5512 5892 5132 4972 4422 4192 4262 5492 5012 4492 5122 5192 4542 4382 3332 3242 4862 3592 3382 3942 3632 4872 2832 3472 2762 3642 3402 3872 3122 1562 3412 2912 1872 3272 1362 1882 2322 2382 2172 1702 1722 1782 0732 1052 1712 1442 0912 0902 0582 1182 0062 0012 1481 9982 0582 1052 1292 0252 0941 9832 0512 0301 8762 0881 9992 0031 8591 8992 0311 9661 9121 9241 8111 8101 9461 8401 8671 7821 8411 8101 6841 7511 8561 8271 7741 8901 7441 7391 7551 7301 7011 7631 7641 8031 7511 7721 6971 6971 6991 7081 7261 6151 7151 7261 7111 7151 6871 6961 6661 6031 6521 6711 5501 6271 5681 6731 6021 6371 6031 6251 5841 5051 5501 5381 5171 5791 5091 5561 5201 5921 6431 5291 5961 4321 5071 7271 5151 5281 4651 5211 4991 5321 4931 5491 4971 4761 4311 5331 4381 3851 4491 4071 4041 3861 4661 3521 4151 3981 3771 4291 3251 3621 4761 3921 4411 3251 2931 3611 4061 3721 3701 3911 3221 3761 3481 3691 3971 3411 3651 2871 3681 3361 3591 3201 4061 3341 2931 3001 2791 3931 3271 3341 2781 3171 2241 2361 2791 2641 2651 2751 2221 2411 2221 2511 2151 2621 2931 2011 1871 1511 2051 2091 2661 1821 1271 2371 1471 1011 1311 2131 0881 2241 1331 1141 1251 0931 0821 0931 1261 1691 0501 1671 0751 1241 1141 0561 0451 0741 0741 0951 2031 1301 1141 1421 0951 1271 0971 0701 0439921 0351 0791 1541 0511 1401 0671 1001 0701 0141 0591 0269891 0291 0181 0119989799761 0479711 0301 0341 0161 0181 0539419911 0309901 1279329969369759659289531 0189219249559539659149671 015850902918898942912879859911866886940983892846848872822870840811914892875834852819906805834852832780855837865826818811817788818819762848829797807803853928830750847826847824759785788757786838816759813772783774778798747789759749734700731716741729748701767732717774697740725692697780695679688709673695658716675715638675677687680614760798692647704733629660607701616679636643667697661621693623672580622622614600626662650616659625593622637702585596604601633646623591632588587600664572593620591621569577560564615565602659554529534554546525585550602565556521550563509570612569534592555579510526554453594550577551543546557588536545561489549532550603573518494563674462528512504537538502529500481485508503563492565505487475537493465525510483477444499449559478479461457503504457492445496455480473462474449458447464439459420482468473457507494475487428418494470491465483457 369100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

004 736 21800000000864 425 5480000000000000604 679 0530000000000013 552 993 18500000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G12G13G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

82.5 %136 204 85782.5 %17.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

0 %00 %100 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

100 %136 204 857100 %0 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

100 %165 130 044100 %0 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

0 %00 %100 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12 %19 789 00012 %88 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

411 692 5323 376 4105 912 90684 722 894020406080100120140160180200220240Phred quality score50M100M150M200M250M300M350M400M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped