European Genome-Phenome Archive

File Quality

File InformationEGAF00003612715

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

135 861 07548 388 07615 572 3755 250 5232 017 497997 168600 407418 167320 109264 046222 154194 767177 898158 519145 674136 545125 116118 562112 807106 408102 61097 80494 03691 36488 49683 85981 21777 34675 70773 39773 12869 06568 41966 67865 51764 07261 05961 85260 03757 58156 92955 14754 84552 86052 32852 27651 12150 55950 40149 60048 65947 47346 58047 53844 98745 18444 46344 56043 36843 31742 87842 23442 51641 64041 48740 65340 62939 84039 61939 81638 98638 76639 05638 66737 89637 84437 68337 21836 75135 91435 59735 50934 43234 27934 51934 21333 89133 85333 44132 59933 00532 58332 46232 32331 35831 46630 74831 45331 03230 23230 51030 25829 70229 29128 53928 72028 72129 20027 46627 93927 97427 42127 16326 79127 18927 35126 64926 79426 66625 77226 08226 04425 68225 03225 26825 15824 88924 92224 62124 87924 18824 11524 09923 78823 83923 52922 91422 59122 93822 72922 20622 44422 34021 84221 80621 54521 50921 41821 31320 83620 95720 56720 36520 51520 38819 97119 79019 58119 41419 64819 47618 92719 19918 74418 97918 55418 24918 17717 64117 73017 86817 56417 76817 02717 24317 06916 67916 54716 28016 09616 42815 98815 96715 98115 81115 50515 53015 33015 33215 06915 22914 67615 04214 13014 39214 06613 92113 80213 67813 47113 68113 24713 53612 93713 39612 84012 89612 74812 66612 68612 37312 38212 06412 05911 90911 67711 82011 71411 37911 18911 10911 04911 16410 89811 12210 41510 64810 46510 67210 20610 0609 83910 1289 66810 1089 1889 5629 7329 2809 3169 3849 1859 0319 1648 6618 8728 7758 6028 4788 2388 2808 5608 2628 3038 2047 8727 8767 9448 0767 7687 6747 6867 5237 4837 4667 4267 1877 1717 1866 9767 1056 7946 8926 7766 7516 6406 5006 5496 5746 4056 3726 3116 3056 1476 2336 2336 1165 9026 0255 8265 6035 7015 7865 5945 5765 4105 3695 5125 4765 0835 3055 2195 0405 1485 0045 0934 8214 8274 8664 7294 6754 5124 6594 4444 6774 5604 5114 2814 3834 1334 1624 1744 0274 1554 2034 0083 9393 9683 9293 7463 7553 7893 7863 8553 7193 6593 7333 5883 6583 4363 5253 5163 4713 5623 5393 4213 4333 4273 3923 1743 4203 1663 3743 2103 2253 3123 0953 1143 1573 1813 0923 1022 9782 9543 0093 0272 9533 0262 7872 7712 6022 7352 8002 8612 6682 6212 7392 7572 7062 6022 5582 4832 4232 5972 3762 4982 4652 5822 4942 2892 4622 3242 4182 4202 4902 5022 2822 3282 2762 2672 3092 2232 3022 2302 2172 1572 1072 1332 2192 1012 1492 0942 1152 0062 1272 0132 1222 1082 0411 9591 9792 0251 9891 9751 9072 0201 9182 0311 9311 9651 9471 8091 9321 8221 8471 7961 8401 8701 9191 8221 7961 8271 7221 7741 7091 6871 7151 7141 7201 7621 6781 6941 7131 6311 7981 6631 7141 5881 5921 5461 6761 7221 5701 5871 5501 5091 5971 5481 5651 5981 5611 5371 4761 5331 4631 4851 5401 5251 4941 4381 4971 4441 4551 3981 4641 4091 4041 4311 4911 4411 4381 4021 3441 4851 3091 3721 2541 3301 4171 4491 3331 2911 3951 2941 3401 2901 2371 2561 2901 2771 2261 2861 2791 2751 2101 3231 2881 1301 2211 2471 2561 2261 2021 2181 1871 1851 1321 2591 1221 1711 1521 0441 0691 1301 0841 1181 0581 1201 0641 0251 0851 0711 0601 0731 0501 0541 0481 0429951 0061 0031 0389521 0439769861 0039089509191 00889996692589396791490993690685489290884091686782584887285086281184689283283387691782882371078277375675577772474073173579972373172969065965372868567973864173068365970065270965961665865262366568659464560562459262859759857155458060058657954254650355753659656157255053057353253051850454152249351653244249648044946146643444543542644141640638443337545443740740540638636941436040233937835139837535734729832835832135733632037733332132735935836834931830932931833133327629930734233025729227127425729627427828230126929626929227525428329525323224421521824020821820423723321222420720519520120422921620119820119921115918218516619617318820118716519716519018615214416417313719318314814311716616714514613115814314114716113713712615511512412610513997130109108991311047311012610412912412812598120961071021048910110612479104941008885971009792103104956781849310993103887483708287777565567772796376667269879366877871818366696564625352558554816556694664576054656558594555744857585543434745494952483847584946395438635848385335464345334631344251573861365140453748423036345248354126414430244232213433242925232932362729304022172725252929172422232923293019262722331817232427294 774100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

337 4030000000000031 397 43300000000018 944 102000028 753 659000080 240 0550000162 039 1300001 083 286 46800510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %18 692 39399.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %18 653 24499.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %39 1490.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %9 366 65550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %18 511 14498.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

66.4 %12 430 42166.4 %33.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

488 2356 2752 20211 3823 0083 0505 6486 0013 02710 6514 1624 5329 9024 7472 1888 4102 9813 7036 5477 1654 83510 2887 73411 86621 78832 8971 794103 6192 5435 7109 4493 9831 45510 5491 6092 4114 1314 4381 31611 470278 7659 4468 05914 03212 08216 66714 99018 81919 79337 79439 16833 299127 0855 25828 01711 5986 53040 5874 3683 40517 185 759051015202530354045505560Phred quality score2M4M6M8M10M12M14M16M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.79%99.77%99.79%99.78%99.79%99.8%99.78%99.79%99.81%99.77%99.8%99.8%99.77%99.79%99.77%99.8%99.79%99.79%99.84%99.8%99.76%99.78%99.8%99.83%0.21%0.23%0.21%0.22%0.21%0.2%0.22%0.21%0.19%0.23%0.2%0.2%0.23%0.21%0.23%0.2%0.21%0.21%0.16%0.2%0.24%0.22%0.2%0.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped