European Genome-Phenome Archive

File Quality

File InformationEGAF00002492830

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

299 406 749466 828 716530 587 115484 682 935375 274 407255 763 836157 332 20489 140 48847 302 94923 956 35211 786 4005 817 8032 975 2811 632 219991 697682 475514 115412 413343 239295 344257 575225 409196 542175 827154 508139 404125 457114 056103 23391 51884 74978 33373 43766 45861 79459 27056 16852 27749 07446 65344 07641 33639 80737 22835 14033 13430 71130 15428 16227 77626 14125 04024 38523 49822 32221 60220 72020 95319 70219 50018 17818 23017 46617 05416 16515 64114 96214 46414 72114 06113 96014 02813 34513 28012 70512 69612 43811 75111 86911 52911 21311 35410 84910 57810 46510 1269 6299 6089 1149 0539 0288 8018 6048 2107 9447 7967 8387 8757 3457 4107 1057 0747 0316 6296 8316 6196 5056 5405 9916 3716 2136 3346 0785 7285 5375 6455 6265 6265 8125 6685 4325 1925 2945 0765 0705 0135 0174 6204 5774 6084 5834 7904 3564 7854 6424 6074 3404 2554 1524 2024 1324 0713 8963 9273 9153 6933 7423 7493 6113 6973 4173 5903 4653 4823 3593 5153 3623 4353 2563 1683 1723 2473 2063 2293 2403 1863 2032 9533 0502 9702 8913 0382 9832 9632 9642 8682 8662 7692 7312 7052 6822 7382 7272 5182 5702 6872 5432 5382 4392 4102 4972 4662 4482 5082 2372 3662 2412 2262 2452 2152 2602 2582 2012 3772 2722 1952 2942 1132 0282 0042 0612 0192 0881 9671 9431 8941 9171 8791 8791 9031 8721 8381 7391 7451 8261 8521 8291 7051 7811 7031 7211 7451 6051 6241 6961 6391 6611 5981 5491 5281 5911 5901 5261 5091 6211 4091 4341 4781 4061 3861 4141 4141 3641 4381 3841 4001 3401 3881 3481 3261 4291 4051 2771 2821 2841 3181 2971 1821 2251 2771 3051 2331 3511 2581 2341 1831 1281 1811 1791 1151 2301 1641 0561 0241 0561 0571 0861 1101 0101 0311 0279881 0119189479901 0501 0061 0881 0911 1279849679991 0478939681 011839866875852886845875823791789856824838872834827824819742777868823759744785723758786787767788787809792776732714770711703758696688729678764691698669657679702671669615651675631619651585649646677619590612643648575671606561591604672616596599612611630581594642599610568625616644618627571634618587593588601600591588583589581578590611557565550586568579564591552608594560545589604577581568600583606652622526548575574538550548581527539562614618600573646633614542581487554540606545496511534570550528506546538534574563524501531528511519494464477496472473494504490433465448446473437471500470464501528525485493459453473457486486487487488472497497520544545476476527482462459458487463449440437471451403384461421414410421381388381418386395378415397404418400359425376392384389361417364387357369362354361400345361392331367315350354321327292288329317297340304309333336338371359305333338351318366318346345336329335338317330303300328273308276288296285277268303259261268298277272298267316301264337345318285311286300285279281269265294267262292260251242268253229239240243261234253220229232209241206225230225257239213223248243238233225237223234235238224221213265205223231237221245245245242258275253262268231240246252231299234244264252234264239253226239241234239273224218238196218228188216216215220200228228194230193220205213193216170206235203183206191212208206209222187219239206203202193218236208220206194218196206235208194213205204211229222202230202209238236282216290239226231229225222203218202215230200223232223228245216199221203229186234237224207231250222247226233243243249233268227219260243241239257246267231243240260257229267260242269265269262250253259279264236251248271241249261281272274237255261270250263277311297281312288352324294309285279290269302261301265312296295290308320292283265276289250295273245255271269305283269272268267296272277307305261271279273304290335292300304297316312311276291288302268273285271244282280257226265247272269238259275265274264248250274250236256252245247243267275264239246248262240239244240219206234228238213239253218221239205226200234215213193202231219209197226219146 346100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

506 756000000046 988 441000714 987 398000000000404 084 8760000442 690 0520000870 169 82400001 701 321 0950007 444 087 42600510152025303540Phred quality score0G1G2G3G4G5G6G7G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.4 %76 521 54499.4 %0.6 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %76 293 44099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %228 1040.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %38 492 83450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.4 %74 247 20496.4 %3.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.2 %2 492 7993.2 %96.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 827 90375 78747 66286 85665 16868 78281 254100 82751 14979 25438 88433 81049 02953 10928 82860 34540 80245 61671 03097 89493 90487 569115 12186 664139 138219 59716 241370 38722 39320 89942 26642 14021 48050 24621 88521 65235 63743 19313 89963 797962 29049 48350 39378 67567 001122 491107 360155 381245 71035 64242 89736 85047 81826 90344 58444 67435 709106 43536 52865 22568 816 739051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M45M50M55M60M65M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.7%99.66%99.7%99.7%99.7%99.7%99.7%99.7%99.69%99.7%99.69%99.7%99.7%99.7%99.69%99.73%99.71%99.69%99.72%99.69%99.71%99.7%99.37%99.56%0.3%0.34%0.3%0.3%0.3%0.3%0.3%0.3%0.31%0.3%0.31%0.3%0.3%0.3%0.31%0.27%0.29%0.31%0.28%0.31%0.29%0.3%0.63%0.44%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped