European Genome-Phenome Archive

File Quality

File InformationEGAF00001688492

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

158 052 52655 380 35125 592 53615 062 5209 959 8997 122 5345 418 7754 276 2913 484 7872 932 3902 514 6722 182 1501 931 9421 736 1961 563 7191 428 3221 309 4501 213 8851 127 2311 059 942999 356943 029894 261849 774808 763773 388739 681710 495681 273653 340630 737608 583586 441567 868551 053534 376517 023504 174488 745475 700459 973445 344433 859421 899412 670403 052392 828384 718373 051364 296355 620348 073340 790333 324323 868317 147309 533302 751298 027291 272285 614279 626274 471268 623262 882257 203253 327248 608243 048239 735235 269230 687226 572222 592220 268214 439210 094207 988205 033201 932198 750195 248191 723188 596185 234182 619179 117175 694173 518170 707167 231165 180162 861160 125157 308154 913152 023150 197147 380145 757143 366141 158139 482138 778135 091134 002131 694130 092127 659126 094125 220122 757122 003120 037117 938116 043114 780111 984111 416110 217108 067107 686105 952103 932103 678102 234100 145100 02798 09896 85496 13195 15293 77892 64090 82489 96188 90488 59286 41785 69684 30083 19382 32582 06879 98879 22278 06977 44176 46376 37875 15773 61173 01172 71071 48370 42970 36068 71967 85967 02866 05765 09764 47763 66463 14562 93962 32360 95860 15459 79859 44258 22157 60557 06256 42655 82454 85754 14753 96653 04752 60752 12951 60350 85550 13349 48949 94548 91548 31247 94248 21646 81646 70046 26745 64645 62745 15144 65644 09543 83042 97542 81842 30841 75041 68040 72741 35040 50239 99939 48439 33438 95938 36038 16337 71136 98737 03936 04836 55536 02635 79635 79535 10534 71934 04234 06333 68933 10633 06032 96332 54632 42631 72031 39231 44831 18731 33930 53630 41930 25629 84829 49829 29929 11228 85028 69728 20028 30927 82028 03727 38327 25827 41626 93926 77426 61726 32126 30125 83825 75125 48825 13325 17924 90924 61524 58924 34123 73824 13123 94423 41923 58923 27323 14822 86922 66222 08922 24321 89421 93721 73621 63621 47021 47821 35120 91920 89620 94920 54820 18720 63220 06919 93719 78019 55819 29119 39019 28618 96518 87718 85618 41718 40418 51818 27718 07217 82517 86017 71617 57517 43517 04217 39617 04416 96916 86416 80516 34116 51916 32716 22316 15115 93915 83415 75615 53915 43115 32815 38015 39915 29515 13615 05614 72414 59514 35114 48114 47114 58714 22214 06914 26913 81913 83713 65013 68713 76313 35613 48913 14413 08513 18313 13512 89812 55312 64412 65512 37112 32512 36212 30412 16612 11611 74611 93211 90911 67511 57211 38011 73611 26511 15811 23811 17511 23111 04610 83510 81910 81810 63510 57410 55510 62110 48210 49210 45710 37810 2769 9979 93210 0059 98410 2909 8779 7249 6269 8419 7329 6399 6959 5079 5019 4199 2109 2179 3099 1379 1019 0588 9588 9258 9578 8198 7898 9358 6508 7298 6798 5808 6698 4288 6478 5058 2098 3568 3238 2508 2468 1908 1267 8667 9617 8427 9177 7727 6497 7627 5807 4337 4927 4447 3227 3237 4377 4987 3627 4317 3507 3527 1897 2097 1297 1546 9427 0706 9466 8926 8467 0576 9326 8806 7206 7576 6816 6066 5456 6786 5156 5246 4376 3156 2026 1586 2646 2676 4496 3836 2946 1806 2086 2896 1056 1526 0926 0536 0245 9495 9886 0735 7285 7105 8015 8645 9595 6685 7435 6295 5665 5125 6445 5765 4445 5455 4075 3765 4535 5145 4685 4015 4175 3325 1665 3125 2945 2205 2325 0795 2015 1025 1844 9935 1005 0095 1314 9324 8014 8705 0455 0074 8644 8894 9024 8094 9114 8194 7904 9164 7524 7024 6004 6024 7244 6504 7004 5714 6494 7274 5074 5464 4514 5644 4624 4624 4144 4514 5594 4144 4504 3324 3784 1734 2834 2824 2734 3454 1544 2294 1904 2694 3174 2804 2684 1424 2004 2364 2044 1844 0414 0904 0574 1353 9524 1063 9483 8533 8383 9293 7863 8173 9553 7223 8853 8533 7893 7763 7103 8303 6553 7563 7673 5753 7153 8083 6743 7463 6783 7073 6983 6773 6073 6213 6153 5673 5733 6803 5053 5663 4773 5663 5233 4993 4493 4473 3303 3483 3233 4633 4103 3503 3253 3123 3753 3093 2723 3693 3053 3403 3383 2503 3153 2733 2333 3293 2723 4053 1833 1393 1733 2083 1483 0183 0933 0773 1013 0742 9923 0883 0753 0613 0172 9912 9852 9002 9542 9362 9322 8552 8202 8642 8692 9422 8222 8472 7742 8302 7612 9222 8342 7752 8172 7592 7172 6422 7672 7432 7202 7202 8352 6702 7142 7102 7472 5082 6922 6022 6572 6942 5442 7082 7282 5702 5162 5652 5442 5622 6062 5572 5482 5952 4892 5192 5242 5372 4822 4132 5442 4382 4742 5092 4542 5472 4272 2942 3672 3562 4132 3312 3252 3132 3362 3712 3332 3542 4022 3982 1822 3302 3762 3102 2362 2092 2652 2912 2352 3532 2682 3172 2102 2722 2962 1902 2382 2402 2222 2742 1372 1302 1382 1112 1322 1022 1042 1422 1742 1412 1032 1772 1742 1012 0402 1042 1332 0782 1092 0282 1022 0792 0852 1082 1152 0662 0302 0222 0171 9671 9882 0421 9641 9932 0362 0091 9761 9631 9692 0192 0361 9451 8991 8841 8651 9451 9561 9591 8881 8431 8911 9141 7701 8871 9881 9091 8711 9121 8731 8191 8631 8971 8051 8561 8531 8161 8511 8911 8881 7631 8101 7811 8221 7871 8001 8421 8221 8221 8361 8481 8361 7461 7771 8641 7871 7971 7251 6771 7051 6771 7361 6691 7181 7241 6871 6631 7101 6621 6171 6561 6131 6851 6521 6321 7171 6111 6401 6171 6681 5871 6041 5141 6041 6021 6621 5291 6041 5621 5501 4971 4831 5611 6001 5421 5291 4921 5221 4691 4591 4261 4951 5271 5471 4761 4911 3841 5091 4191 4451 4561 4741 4581 4601 4521 3901 4461 4781 5181 4521 3491 3771 4611 3671 4041 4421 3711 4581 4321 3471 4001 4201 4341 4381 4001 4031 4191 4451 4221 3551 3541 3211 3531 3391 3531 3351 3991 3181 3061 4091 2991 3251 2741 3591 3611 3871 3131 2841 3281 3551 1871 3091 2521 2321 2871 2711 2261 2281 2851 3211 2341 2421 2021 2941 2671 2471 1801 2791 2551 2001 2441 2101 2641 2651 1791 1311 1391 1781 2521 1681 1971 2381 2681 1971 2761 1471 1971 1761 1871 1401 1161 2371 1801 1481 1401 1571 1131 1371 1961 1151 1531 1111 0991 1171 0931 105785 085100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

11 002 7280000000000000361 582 907000000025 773 0460000308 139 11600000960 014 4390006 911 979 96400000510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G5G5.5G6G6.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

97 %110 932 91897 %3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

97 %110 932 91897 %3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %00 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %57 189 94850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97 %110 932 91897 %3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

17.1 %19 512 01217.1 %82.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 230 6623 705 7407 397 052105 478 348020406080100120140160180200220240Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped