European Genome-Phenome Archive

File Quality

File InformationEGAF00006164700

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

7 956 5866 700 9606 964 64210 152 65416 102 04425 797 13639 671 10857 905 31879 947 080104 411 686129 568 546153 379 846173 708 069188 968 230198 111 802200 540 254196 824 292187 491 752173 785 306157 134 844138 815 269119 874 048101 517 95684 366 09468 871 72055 293 54043 778 64134 166 32126 349 39520 116 71115 210 73211 441 4848 543 8886 382 4554 775 6793 587 4572 713 8302 084 9721 634 1131 307 4001 063 825885 553757 201655 382581 540519 868474 072434 097400 244370 490346 442325 393304 614287 721271 306254 356241 262228 035215 596206 387194 864185 735176 356168 796159 492152 340145 451140 200133 395129 015124 988121 125114 892110 642108 563105 342102 749100 11096 33893 23790 78487 77985 70982 75881 38777 64475 69072 56970 32268 68767 33064 42763 04361 90060 31358 70056 18854 11052 15450 43348 97647 95645 83044 88143 24741 68540 52438 87438 25337 53737 04736 47735 17434 57533 13732 15431 85031 06329 55928 88728 78028 07427 40125 98525 35524 12124 08223 88323 36222 81222 74022 05322 00521 18320 89320 28619 79319 95819 74819 23718 68218 75718 13817 53717 04616 70416 50915 97715 52315 19115 07215 18014 46814 28414 17413 66913 83913 25913 00612 74412 54512 74212 04412 12311 95411 65811 71611 57411 07611 16710 91310 90110 75310 69610 22910 04410 2689 8969 5839 4599 4879 4209 3589 3059 3679 2029 1268 6758 5508 5098 2738 3538 5048 0318 1437 7747 9677 7857 6977 6447 3787 3887 2907 4237 1936 8376 8326 6406 6226 5506 5556 3716 3666 1666 1905 9846 0876 1165 9505 9635 8025 9375 7255 7355 6455 5395 4435 4545 3245 1515 2355 2145 1715 1714 9965 0224 7224 6224 4694 6584 6294 5474 4774 5034 4474 2754 2764 4464 3834 1274 2604 0654 1874 0464 0734 0704 0414 0244 0083 9243 7293 8913 9003 8253 8393 7553 6943 6503 6463 7203 6663 4893 6673 6523 5583 5863 3923 4683 4553 4583 4143 4483 5243 4493 4713 4063 3843 3143 3013 3833 3203 1763 3573 3633 1563 3283 2233 1633 2673 1253 1073 1523 0733 1513 1503 0583 0562 9382 8622 9102 8302 9362 8702 9862 8052 8493 0042 9372 8662 8332 9822 7182 8512 7822 8502 8832 8532 7482 7192 6312 6642 6082 5972 5932 6972 5952 5962 5992 5852 5692 5652 4342 5172 5862 5282 2872 3242 4732 2992 2012 2482 3582 3132 3122 2452 2722 2682 2602 1982 2172 1082 1232 1062 0751 9872 1211 9832 0912 0651 9912 0482 0121 9651 9151 9091 9181 9061 8632 0051 9321 9061 8801 8791 8621 7691 8541 9021 8991 8091 8491 8411 8831 7761 8991 8571 7741 9201 9171 8941 9251 8941 8781 7821 7831 7651 7481 7971 8761 8441 7831 7001 7151 6261 6771 6091 7091 5891 6131 6691 6381 6411 5451 5711 6221 5871 6991 7221 6431 6941 6511 6191 6091 5601 5141 5731 6091 6001 5111 5811 5221 4571 5191 5621 5351 5191 4951 5541 5661 5231 5531 5901 5441 4511 5021 5361 3881 4171 4291 4231 4331 4181 4321 3841 3761 4061 3881 4281 4221 4151 3541 3401 4061 3421 3601 4081 3371 3301 3781 3731 2851 4111 3811 3071 3451 2531 2951 2511 2291 2231 2261 2041 1771 2291 1731 2511 1801 1261 1841 1721 1821 1641 1091 1521 2351 1661 1121 2051 1261 1441 1501 1101 1511 0601 1741 0671 0791 1051 0821 1051 1151 0631 0891 0861 1291 1221 0711 0531 0691 0471 0981 0861 1171 0871 0561 0741 1211 0511 0101 0591 0661 0651 0571 0461 0949471 029971991944885999933968912978941945922977978926983926952877906911880851851882796918824847862867809861819873837911823896861872888881802873881892834901859896873850841926888932848848874828799832796801841820787749757891877822731761779758858716783729733798791754810816765782772734806801685679737730712778736750750797780725790697706762692711661727706696712663686701664699670647679720635642678683711647649649724665647661576605623605590688609604621623615625619606595590573598582596575593585590575565560589575576548526500504508529510541537531533574549553560573544546536500512530510512503614544504522546527518522522543521545534512492493484499510460479462497505501494435476470490488495494442465462457484504501451432482522523478522474504467493481521460509471464456444488491463461473486482518477470485479499452472442446449460419478443456430439437466400440460458433435434428461436436433429488441447434429461429425422418429439451465437422440445418396402427479490423390362430413446441424437415417449440464412443431419460399420404416457461443435454446465432410393426452405396375409414394366395380387436372419366381403374395358357372388390359368395404388387395394395360379396365382396369376371355355354357366351383357342357335332320310372312332283311315282317329362310326322304313330319323328325314300330306352311337299305299305318285310325302287286249268285263285280269285287309311461 986100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 315 15200000000001 437 648 57800000000000002 416 483 1200000000000048 119 684 51800000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %343 753 61099.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %343 500 99299.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %252 6180.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %172 103 08450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.6 %332 513 02896.6 %3.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

24.4 %83 814 66624.4 %75.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

14 193 299330 281219 133965 497312 702304 831451 830575 443247 762358 964156 408138 800171 506213 617205 577276 469172 380216 462248 620365 986366 836368 770458 781352 042562 057976 63165 6511 715 27390 55184 283159 907179 97798 795215 39188 08188 076130 112185 93654 586287 0883 951 670193 513170 295323 866253 853492 868433 491650 0451 168 394113 903164 832135 227185 41776 895161 581150 356102 639449 132101 840229 463312 553 039051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M220M240M260M280M300M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.93%99.9%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.93%99.92%99.92%99.93%99.93%99.82%99.79%0.07%0.1%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.07%0.08%0.08%0.07%0.07%0.18%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped