European Genome-Phenome Archive

File Quality

File InformationEGAF00004837549

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

142 789 382247 102 998331 497 995376 132 534375 874 536340 564 427285 331 757223 911 494166 537 554118 228 07480 877 63253 577 68234 568 18021 882 10013 654 3808 476 3285 262 7693 308 0582 140 0791 430 577988 974721 050555 982447 991366 715311 577268 074232 276204 780181 331162 641147 079132 596120 761108 24898 42889 46882 50275 51869 56465 66161 76656 70652 98649 61646 50443 57242 56941 12038 59936 21133 85232 59630 38329 18627 93525 93425 19724 94223 16522 20421 91021 34120 11919 58618 40318 00316 97416 79716 76616 02515 33115 21814 79315 00214 36713 95814 06013 26612 73512 75012 36411 84711 79511 23011 05711 06610 77610 66010 46910 13510 0669 9199 8399 3419 3868 9818 9078 6708 4538 3598 1438 1808 0037 8997 4507 1127 0336 8937 0276 8896 6946 3826 5126 4376 4626 5086 3356 4426 4406 2356 1355 8095 7545 4455 7575 7865 4225 4745 3935 1194 9184 8554 9224 8814 8704 6484 7404 6694 4654 4184 3304 2994 4944 5584 4794 4534 2344 1693 9773 8053 9523 8613 8713 9323 6753 6313 8183 7303 5313 6463 5183 3603 3603 3233 2563 3323 3363 1273 2983 0783 0933 0753 1923 0123 1443 0452 9363 0142 8582 8632 8992 7822 7442 8022 7482 7872 8442 7702 8482 6282 7622 7732 6082 6402 4602 5442 4872 5242 6322 6002 4422 3882 3722 3912 4342 4002 3742 3252 3832 1782 2982 3732 1502 0922 3012 1562 2152 2532 1011 9952 2252 0402 0341 9492 0821 8782 0112 0151 9791 9671 9141 8151 8831 7461 8421 8111 6871 7311 6441 7601 6971 7841 7601 7271 7301 6121 6021 6531 7081 6541 5681 6621 6021 6321 6231 6121 4961 6731 5981 5501 4911 5501 5301 5831 5301 4861 5341 6211 4791 5281 4071 4481 5241 5541 4871 4911 4911 4121 4681 4861 4971 4551 5091 4661 3771 4111 4481 4581 4281 4191 4341 4601 4391 4111 4391 3291 3951 3261 2951 3101 2721 3421 2461 3151 2701 2701 1971 2861 2471 2361 3171 2331 2911 2301 2321 2611 1991 2281 1951 1881 1961 2251 2181 1841 2241 2151 2471 1531 1931 2291 1101 1351 0381 1211 1021 1481 1191 2161 2171 1211 0811 0581 0089399571 0959951 0271 0311 0771 0221 0081 0831 0141 0491 0589519701 0091 0271 0009871 0851 1191 0211 0201 0379469739819959911 0731 0311 0829989691 0139499429789861 0009819249901 034980955978900940956937878945894913869909820842903958859855862896853832858836862818913945934882825851824809795821838920831768824842805819775762830760773787761796810753728702733700671743700775649678664664640673694679659610647655631674644648701683652652645624651659614586641622637682641582663684639564632590615603598623556608605646591596611567554592552583613625588630586554503496492487559552492510528484501559557566530491472538514510505499529509551534478487534518542538503485494497472490437454476481514484482445486483504493516478499487481508489428481451496434448486466465507475497479505468493460480485503433486449474496428508491484405543481480477414516431484481455444442453480474431459457468462510493521462484467498464439429477463424439434447448459436468437438429431412459430464462419444429417419439451471458461456449441419416469430457424451438434416464402389417405390405363377397384382376372378380349418389382380371386364362381332366355379380364354359349383377344362363375370358389343366367384357355358340373373392400376364359344363360319343351304353332366322360335369318317329339313314324342370352356344318371386309357343351340350331345343370372392330356369373311382331306322327342361302314308322359354309341305325318299320345314307312334375327331369318318324313292311285314304315303288308314288331312303316300314303338350333310331318299307286303305323290275338284293307274278302305299404278298270262268251295302284268261279263281307260300285291294325313280238296258297266267275258252251245233259255255244258260235264261248269248282263226241253232233239251292283255244237236249231251207246226220219203204234210261199208208194207191218192198216185201185210188229199185199200187183198182196208177182189180203176191186186170188162193179178182169142173165156163161148188172193193173186159206176198184168169209158169159135138162156167168171133203 633100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 066 527000000070 030 641000914 083 240000000000533 159 3400000627 939 54100001 246 251 49800002 489 288 70900011 274 175 89000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %112 680 62799.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99 %112 470 51699 %1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %210 1110.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %56 811 24350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

95.2 %108 131 80695.2 %4.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

10.2 %11 589 03110.2 %89.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 544 35099 92867 702122 66998 687105 732112 336152 22888 339117 49761 19149 99061 63568 33838 26379 72958 48764 51079 871112 057118 022122 676162 446123 430193 132298 03026 493513 62434 48730 99753 57459 49137 76873 15133 20233 33746 38160 84122 34693 3611 133 17269 92170 376117 18797 823176 504151 479246 461380 73350 08260 37753 78268 36035 67981 12262 10247 584148 54850 31295 821103 392 183051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.81%99.79%99.81%99.81%99.81%99.81%99.82%99.82%99.82%99.82%99.81%99.82%99.81%99.82%99.82%99.83%99.83%99.81%99.85%99.82%99.82%99.81%99.85%99.78%0.19%0.21%0.19%0.19%0.19%0.19%0.18%0.18%0.18%0.18%0.19%0.18%0.19%0.18%0.18%0.17%0.17%0.19%0.15%0.18%0.18%0.19%0.15%0.22%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped