European Genome-Phenome Archive

File Quality

File InformationEGAF00002336720

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

968 515883 527880 405938 6291 055 6251 227 7111 478 8181 809 6512 270 3742 856 3423 565 0704 417 8685 393 9416 495 8797 698 5698 973 89110 295 89811 691 25113 128 23914 652 80116 293 35818 040 59419 990 61322 111 29924 476 07927 189 47530 166 13033 505 06537 202 85741 238 27145 509 49450 056 71054 761 36559 551 44164 316 11568 993 23273 441 79077 612 66281 423 64984 685 70587 424 06089 501 01490 966 00091 751 23091 830 21691 282 70390 085 29888 298 14885 920 82683 074 38479 768 25576 166 38972 247 34068 155 29063 903 89359 584 06855 293 30850 993 65346 816 03542 771 78038 908 95335 206 02931 756 07128 500 69625 481 81922 714 28520 191 57517 853 21815 776 19413 879 03912 164 00010 662 4899 308 6108 123 2007 067 4536 153 1465 359 0704 648 2984 045 2943 516 7643 062 4942 669 2542 331 2222 041 8121 785 9211 571 6001 389 2881 232 5121 093 293978 727879 920795 388719 740658 747604 023554 534511 702474 660443 953414 549390 472366 144348 064329 581313 753298 430286 612272 544260 958250 538239 851230 093222 142214 364205 654198 714191 676185 871179 786173 171168 287164 646159 151153 520148 478144 874140 984136 137132 315127 225125 087120 885117 815114 401110 164107 693105 458101 86599 52396 45494 20092 13889 60088 45585 50383 28681 45279 55877 89275 02473 90671 86470 00767 99466 99766 83963 94963 03361 21760 25259 16057 76656 37155 42754 08053 31751 81250 83150 26949 39248 77947 51246 78346 35045 16744 24443 82543 24242 44041 73141 30940 53040 06239 82838 76638 38137 96737 40536 83735 86836 09535 87934 55134 70333 83134 05333 41033 07032 20831 77631 08930 47830 61330 09530 06229 55929 66629 08828 66328 24628 50528 77928 33528 09327 66927 23126 65526 96626 41226 25825 62125 76625 73125 17124 78424 25924 10223 99524 13023 75523 56523 39222 99723 04022 91522 62022 70121 70521 58021 55221 33221 16620 91320 57020 47320 36820 25619 91019 59619 24619 42319 33318 63618 49318 28618 14817 81017 93517 63017 42517 27317 03316 88016 63716 70516 50616 16715 98415 86715 78815 66615 62115 43115 08915 17515 00114 75314 67614 34314 29114 23514 33614 10413 87314 09213 65213 59613 20313 29612 93012 61312 58212 26512 07511 97112 08011 59911 72611 30511 48711 36111 07310 90510 81110 68610 52410 18610 1709 9379 68610 0089 8799 6169 5859 4379 1439 1939 0778 9168 7908 6028 5038 4028 3038 2318 1148 0107 9657 8557 7787 9067 5597 7637 6087 3397 4687 2307 1727 1776 8256 9026 8426 7606 6006 7546 5376 5406 6706 6226 4836 3656 1846 2086 2036 2546 2216 0916 0215 8505 9895 8405 6865 6075 7935 6075 5255 5515 3685 3565 2845 2925 2045 2945 2215 1235 1985 0874 9534 9134 9904 7614 7494 6774 6924 7344 6234 7274 5684 6064 6974 5274 4964 4214 2864 4294 5174 4574 4864 3164 2174 3054 3714 2434 1834 2074 0534 1694 1404 1354 0194 0953 9983 9333 9873 9353 8853 8083 8773 8833 8563 9043 6813 8643 9253 8683 7473 7643 7903 7623 6633 6783 6053 5433 5963 5493 5563 5363 6183 6303 4823 6803 4673 6083 4203 4443 4123 3933 3173 3033 1373 3423 3423 1163 1083 2393 2513 3133 2623 1303 1363 0793 1763 0693 1093 1223 1643 0833 0933 0912 9623 0363 0563 0103 0443 1573 0062 9822 8482 8232 8622 8022 8512 8212 8142 7802 7652 8632 8152 7662 8172 7332 7072 7192 8032 7042 5712 7762 6812 6312 6592 7142 7062 6912 5822 7742 5432 7042 6092 5192 5032 4732 5612 5812 4962 4702 5552 3842 5032 4582 4282 4212 5192 4302 4212 3912 3682 4452 3582 3192 2492 3232 3402 1872 2292 2212 0962 1722 2412 2622 2322 1702 1262 1922 1702 1192 1682 0822 1972 1271 9862 0192 0152 1302 1002 0682 1172 0162 0172 0461 9651 9661 9432 0381 9271 9131 8771 8751 9431 8861 9451 9441 8831 8391 8701 8271 8831 8791 9271 8661 8831 7881 8691 8201 8151 7941 8231 8311 9031 8781 8171 7901 7431 6921 8241 7691 7641 8551 7591 7091 7001 6601 6631 6791 6331 6891 6621 6881 7131 6351 5891 6301 6231 5701 7081 6431 7151 6801 6211 7221 6281 6751 6311 5611 5771 4971 5781 4991 4911 5151 4791 4511 4591 4761 4891 4921 4291 5281 4781 4371 4381 4451 5351 3851 3851 3141 3711 4281 5031 5021 5011 3811 4011 3391 3721 3891 4581 4991 3401 4301 3141 3631 3881 3631 3301 3371 2951 2941 3351 3361 3371 2931 2511 2891 3091 2781 3081 2581 3891 2911 3321 2701 2531 3231 3151 2871 2871 2531 3331 2531 2541 2291 2421 2921 2351 2221 2961 2991 3141 2221 1921 2561 2491 1711 2511 2601 1921 1431 0921 2061 2651 1901 2221 1061 1711 1291 1081 1351 0471 1111 0571 0821 1071 0421 0281 0649871 1151 0481 0431 0671 0141 0421 0291 0871 0239991 0531 0411 0131 0461 0451 0221 0509669879579341 0299319901 0129561 0099599759629449689438859729489349289261 0039599619439959139458488838938898878999319089338529479419579328898919139169098551 0009651 008911921878913868876937908891890891823839853867835852836824856839812796842806814824806856801820817830816843874798878899911904874881813802861808841823832758813782819820723801763736823792788755765798823786826770818788776709729763796781723774749743777740761794667719729754744689745709747726719710762673706713747790764774672734738706691682775695660712735724702665692663712655640693709763744726693698727729693669695680707721669673638652649650646631677640602623666648654643652678650630651684663648663627621638598608624630648673635670688644637616598658623644686684653635624642626600636594587602641653600590593599595630617615888 637100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

14 035 5010000000590 956 7550006 794 597 0950000000003 728 327 00000004 073 213 79900008 309 400 014000016 714 848 62400088 963 965 58400510152025303540Phred quality score0G10G20G30G40G50G60G70G80G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.1 %847 752 78999.1 %0.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.6 %843 723 10898.6 %1.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.5 %4 029 6810.5 %99.5 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %427 779 28650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

91.6 %783 809 25491.6 %8.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

30.7 %262 566 33430.7 %69.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

57 138 4411 165 029880 0811 482 2571 216 4861 397 0271 525 8191 981 4731 501 7991 565 058930 447729 168813 010774 176459 578880 524586 112689 656832 1681 098 6051 164 0301 391 7221 833 4461 147 0371 763 6842 533 924552 1564 168 349602 325514 566786 897888 688906 853962 937553 597566 272677 414849 893427 9181 279 45810 030 090817 637851 4091 334 280934 7621 482 2301 272 8971 881 7032 950 581675 811753 832726 913837 242522 7581 486 835729 914659 9241 372 735709 4521 042 652802 958 717051015202530354045505560Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.55%99.52%99.54%99.55%99.54%99.55%99.56%99.55%99.55%99.55%99.55%99.55%99.55%99.55%99.55%99.64%99.58%99.54%99.63%99.55%99.56%99.55%99.64%99.58%0.45%0.48%0.46%0.45%0.46%0.45%0.44%0.45%0.45%0.45%0.45%0.45%0.45%0.45%0.45%0.36%0.42%0.46%0.37%0.45%0.44%0.45%0.36%0.42%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped