European Genome-Phenome Archive

File Quality

File InformationEGAF00000644639

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

268 206 12750 888 68812 839 3236 999 3674 852 2533 977 1663 439 7543 074 6252 796 2652 575 3482 400 6632 247 2482 117 1462 008 1891 905 6631 824 8161 751 1251 683 3811 616 3291 558 8451 504 6241 458 1031 412 6781 367 7981 321 7281 284 3991 245 8751 208 1881 172 9821 142 2521 111 2651 078 5111 046 1851 013 912987 153957 964929 855901 048876 585853 259828 468802 208780 621755 760736 626715 649691 833671 273651 219631 079609 752591 683572 194553 569536 483518 314500 142486 749471 109456 867441 717429 557415 737403 535391 820379 393369 727357 420347 437337 129327 018317 055307 672297 102287 498279 042270 222263 670255 364247 835239 691232 579225 385218 381212 243205 671198 628192 611188 270182 537178 363173 135167 842162 414156 831152 858148 266143 921140 365135 974132 640128 421125 012122 159118 372115 619112 359109 448106 733102 679101 12997 78995 14693 24390 86488 53786 41283 53681 11579 04376 58875 18472 74871 13169 53967 54165 62564 60562 94961 41359 49058 22856 79655 44753 83352 80651 08149 93349 23547 74147 11545 58944 57243 29342 64941 26140 09539 45538 55738 03236 85535 79635 17434 21633 69732 86431 91631 42430 24529 43329 44928 58927 82427 34026 65826 34325 46425 28124 55223 87023 65022 79122 44422 09721 62621 03720 52820 39619 46019 11218 81918 52518 02017 83317 30317 01316 91916 63516 18715 92015 30615 19514 78514 87214 52014 28714 19113 93313 58013 19813 06212 60712 53612 20712 03011 87211 38511 51611 19910 99710 85410 62710 52410 22010 16210 1329 7759 6619 7399 6009 4589 2869 0308 9618 9098 7358 4888 2198 4498 2408 0297 9837 7367 8917 4727 4667 5457 1207 1027 0946 9936 8936 7046 6606 5586 2856 2406 2146 0065 9046 0075 8885 7165 5795 6245 3485 4195 1345 1104 9584 8994 8474 8284 7784 6804 7684 6154 6074 4594 5404 4204 2954 2584 3424 2164 2604 0814 0854 1293 8383 9233 8433 6483 6613 6713 5773 5493 4833 4863 4483 4703 2403 3893 1403 1733 0403 0462 9723 0492 9722 9592 9002 9862 8472 8502 7412 7022 6812 6002 6232 6682 6012 5342 4702 5392 4772 5002 3722 3672 3672 3912 3332 3082 3372 2672 3592 3192 2762 2892 3012 2452 2352 1772 2032 1892 1772 0932 2202 0872 0962 1162 0022 0582 0342 0131 9541 8831 9591 9421 9071 8641 8351 8841 8041 8031 7551 7391 7101 7421 7131 6641 7201 6791 6661 6531 6771 6741 7281 6241 6151 5511 5721 6581 5571 5421 6181 5291 5291 5451 5291 4391 4191 4441 4491 4731 4421 3431 3741 4691 3691 4151 4091 4011 3381 2871 3451 3491 3111 3001 2851 2781 3551 3141 2611 2571 2371 1861 2651 1471 1391 1681 2201 1931 1751 1491 1581 1171 1381 1571 1491 1191 0961 0661 1151 0501 0681 0409661 1081 0291 0681 0641 0751 0431 0339779901 0189691 0419521 0149301 0139919199551 0009431 01797394196989194391191890594992595993687689193385379692188491484483186184982289585382984178082483680977782078077077977977571576171272175972870865467170269471867471668566864165365564766062568366966666961864666665863863466064769769071861566665570365670170260868159764461162561064860462558160663961359461559157656557160158853852857458952254349652451350251547747546946952552148749248644747145356444851547847450047849446049749146644448343048847148246449849944550444945947043544940944340741444644444943643940542642539745340441741538844141442742242839244645241642344842941844439838238037741037037539237338844037539536237738637937339338739737938136036237237734036737233934335834432535231332533930635533433333331933433034430833633532630233433931930631429031932928029730330327829329227229231126828228328027225227326927025426824525228127424125525725225825226923826226522024129022425023523523923622926022523921422022922622823320723722824723619921722520421821320119019821321818422419920420519421019421920920723222721819119323620622020720822619520618221718822721819820920022121320421018220516719819120218219117519121116616415617017517417416919116315816416614518117916819116217416718618919620716918818718917818420420120620416518219017119214616317414417518115216019018618817117415517417416714616616616916616813715915015213414513613713914315314817315814813014214612413913314416114713513513213212513013112013213913714314413714311814015213614013413514813512413614613911714014513813611313612413413412111011412211213910711311610694121114129117119115101123118101107107123102107119949099118109939298941181138410114310111811197951059198108959384849179988331 635100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

280 8831 27615 709138 327294 578157 6151 752 0343 768 236552 411657 45911 313 04410 146 9506 927 0043 934 0251 767 006466 130562 0001 803 1363 341 4015 682 33832 878 44714 170 19020 304 93716 494 42313 931 42717 256 40612 518 2208 304 02412 292 98732 002 12464 916 04934 175 98063 532 45176 388 288127 530 999165 128 991290 052 365442 418 2341 014 105 0491 091 649 878333 655 27276 587 40924 340 0204 160 87902 083 4890051015202530354045Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.3 %53 568 11699.3 %0.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %53 451 00099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %117 1160.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %26 962 93450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %53 207 13498.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4 %2 154 9624 %96 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 312 2334 2804 4417 6474 15610 4209 64320 42016 58448 52841 12912 16589 47315 45815 386205 46534 75190 45674 68028 954151 4341 417101 527348 8181 31017 3801 7401 6431 7861 091 0894 6713 7344 9406 7066 4029 486219 126785 01617 7664 77632 73817 4541 89449 1782 1063 756141 7323 6186 9106 15814 8567 99826 59026 73842 33278 848194 42044 441 506051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped