European Genome-Phenome Archive

File Quality

File InformationEGAF00000643268

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

399 251 01496 681 83724 282 32310 422 2026 054 2344 480 5713 674 0543 188 4382 862 4482 604 2462 417 8902 266 1082 143 4212 019 4681 930 2691 844 0881 764 0301 688 3991 624 2371 561 4651 506 1431 451 4031 402 1171 351 1391 302 6841 256 9261 214 4331 170 2921 129 0521 087 7141 047 4951 011 037976 292940 947906 447873 559844 031810 679785 148753 135726 390701 519676 262652 059627 867606 874586 501565 603542 866524 262505 855488 849471 952454 395440 752426 221410 729397 699384 552371 148361 737348 475338 704328 194319 164308 160297 815290 618281 392270 307261 375253 572246 867239 179232 005226 088219 515212 920206 723200 364195 848190 802185 949180 433176 467171 200166 573162 018156 829152 435148 797144 891141 269137 128133 636129 923125 033122 983119 766116 140113 338111 508108 391105 622102 958100 00897 80595 63493 23691 36689 65488 00685 88483 35781 26879 69377 94776 01674 43073 04471 12569 32767 74366 41365 01862 93861 73460 32459 07058 00656 77855 54854 82954 00752 27451 16850 76149 47448 19647 50846 11745 48444 49043 53442 54541 71440 67439 96439 57138 93338 26136 99936 15935 84634 82534 91034 06333 46833 06332 38031 50830 82130 18329 38528 84128 27627 79927 33726 68026 30926 05025 44725 24624 80124 57323 81223 74723 37323 17322 76322 45121 76121 33921 35720 83120 42219 93119 64919 52818 88119 00918 82418 40518 06417 77617 32716 92116 78516 53316 33515 76515 61415 38015 37114 78414 54214 54014 35514 10913 96613 87213 61113 17913 02812 77412 64112 29311 95311 83511 69511 63711 58911 30011 24611 14610 88210 83410 63210 65310 48210 22910 10810 2289 9789 9759 5149 3429 3759 0438 9838 9248 8718 5498 5398 5008 3218 4138 0698 0607 9657 8247 8507 7827 5367 4957 3537 3567 4497 3347 1757 1057 0766 9726 8516 9646 6826 5236 6296 4736 3856 2526 2876 1856 2006 0915 8535 8545 7895 7625 7045 6575 4715 4625 4515 3905 3175 2025 0295 1595 0095 0375 0445 0324 9664 8304 6374 8674 6124 6284 5884 6384 4854 3804 2184 3404 3474 2384 1704 0414 1754 0443 9724 0443 9613 9263 9283 8663 7593 8223 9003 7713 7063 7223 6563 6343 5393 4683 4743 4023 4603 3813 3583 2153 2903 2163 1883 2223 2473 2073 1653 1473 0883 0253 0302 9573 0122 9102 8672 9392 8152 8332 7842 7982 7312 6302 7332 6512 6882 6502 4932 5742 5602 4622 5662 4732 4052 3912 3842 3112 3842 3312 2582 3372 3222 3012 2272 2482 0962 1432 2202 1372 1462 0232 1182 0282 0642 1472 0422 0412 0572 0022 0191 9291 9521 9871 9001 8921 8211 8481 8591 8411 8561 8231 8611 8531 7891 8631 8271 7611 8231 7071 7451 7021 7681 6751 6391 6841 7131 5631 5751 6011 5351 5911 5881 6131 5911 6011 4741 5011 4601 4741 3981 4951 4731 5001 4641 3961 4221 4391 4141 3991 4401 3661 3631 3001 4031 3091 2991 3321 2951 2701 2301 2741 1981 2561 3111 2171 2861 1941 2961 2851 2501 2611 3321 2841 2151 2681 1961 2261 2941 1761 1711 1941 1941 1341 2171 1331 1781 0861 1971 1281 1291 0881 1171 0899981 1041 0221 0471 0771 0951 0451 0251 0271 0311 0351 0291 0239859819599699149389641 01889790289688084694687687491487884285588887988289683281386988485885783484782779284480983684582277084584286576785980780678478972077176278578079173079572674673876277074372474673873569475469176171770664576071769973674274273976370169871770769570573973970670371373868165567165864763569262568664463961563366465164660069162562864361264564158259462859558259459860256757158858158958256660157361357660355656055357252352952353258151153957052353548151548051450553551851850653255250352955750651453755755650951154653653051552353251850147747451246852847947346251746748949248447548546644645843543644943944346644042250744845741444145846551142042042341137239343040840438839138341239341540840039141438240437041639436534737039534233938737536937835135637336632237235435932334233935834632132136033335632034929433428733430431931629032331029731429029929928630731130129532027733630532234529433931931531630629832732934030826831230227228428326728531831930829128129531528726726426927928426926727325926730027325927025725725125928124826726825028527227628323426025124122923725623527426228122925625223325123623423822924324024322722222221224721521921821120322623324024221320624125023725122524720923620822122520922320522022522422820522023420219418721022523720120421524519617424920320623220120019319623121018021821722121921018821020520518320221322919420717821819920722420819520119920819520318019119819521317821120519019418119221516121118415016817715017117517117018615617517517920317717617318416816817817019817217018518685 288100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00378 077003 465 2833 636 41814 502 16415 034 4597 460 3349 745 4523 816 3312 553 0206 450 3212 646 9856 581 5366 342 2276 547 44012 158 1947 849 2348 007 0325 982 46910 920 82615 991 55118 748 28221 227 16233 262 30232 458 82828 926 69131 603 50070 867 521115 336 04368 103 342116 151 365223 017 264267 409 887167 816 960426 563 477335 574 799514 124 619622 709 3001 107 099 25500510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %57 454 59399 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %57 277 25298.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %177 3410.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %29 007 13350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.6 %57 173 54698.6 %1.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.5 %2 605 3304.5 %95.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 176 5016 7096 76012 2767 54117 33615 94227 65821 70268 37555 39817 396117 59927 21520 211243 50943 291120 66989 81434 039180 1002 471110 035390 2822 01322 8213 0953 0862 9481 502 6897 1966 0246 5808 7008 55413 240224 560683 61522 7866 01040 16221 1703 81856 3603 7585 978155 9367 88211 62611 70425 34011 84436 62233 88450 83891 464210 45445 898 680051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M45M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.66%99.71%99.85%99.37%99.79%99.86%99.65%99.56%99.43%99.68%99.79%99.87%99.85%99.79%99.62%99.56%99.59%99.7%99.83%99.79%99.68%99.67%95.42%99.66%0.34%0.29%0.15%0.63%0.21%0.14%0.35%0.44%0.57%0.32%0.21%0.13%0.15%0.21%0.38%0.44%0.41%0.3%0.17%0.21%0.32%0.33%4.58%0.34%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped