European Genome-Phenome Archive

File Quality

File InformationEGAF00002234580

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

47 370 199101 096 607170 060 048240 630 786297 643 358329 950 106333 050 287310 079 026268 754 730219 021 396168 987 685124 276 32487 656 08759 764 57739 654 99825 672 56516 414 58610 373 6416 573 8664 220 3622 771 6091 887 2771 335 193991 343773 350625 730516 539445 401392 718344 665310 758278 642253 908232 877214 945195 217178 227164 410153 386142 233131 879122 296113 073105 30497 95989 98484 65778 49273 56768 89464 64261 49958 70955 69552 47250 54948 31747 53244 03343 65840 77239 50338 22837 31336 27234 60932 73831 94530 32029 46528 72628 26627 18226 20226 09324 84424 48323 28323 14322 35921 77221 22820 88719 75919 70019 16618 77818 27217 75617 72517 27317 38516 93116 30115 79415 25915 36514 83914 49714 55514 35513 90613 73413 41213 00612 68212 42411 88211 68111 82611 60511 28611 30711 15510 96811 07110 93710 86010 81410 43310 34510 20810 2199 9269 8359 6089 6599 5359 3779 0088 8918 8368 8378 7598 5808 4388 0467 8197 8157 4667 6027 3717 4467 1907 0767 1027 0316 8926 7736 8296 5606 7566 5936 4766 6106 5356 2226 1236 1126 0505 9226 1195 9546 0105 8395 7415 6335 5135 5535 2465 4855 5485 3415 4075 2215 3075 0704 9195 2095 0045 0324 9914 8714 6404 8054 7244 5564 5884 3844 4674 6484 5864 4204 5504 2944 4024 3024 2064 0544 0334 1033 9953 8703 8393 8343 7083 7253 6483 7793 4943 6463 6323 5483 4813 5533 3643 4213 1713 3033 3133 1793 1643 1913 2633 1803 0423 2763 2113 3863 1973 2393 1153 0323 0603 0913 0783 1672 9822 9523 0402 8312 9042 8802 9462 8692 7632 6482 7742 6862 6552 7022 6482 6932 6822 5962 5492 5472 4252 5232 6062 4922 4452 4702 5672 4322 3772 4532 4952 4812 4022 3122 3082 3222 3772 2402 2992 2272 2832 2032 2092 0802 1372 0441 9672 0842 0201 9652 0181 9972 0191 9191 9621 9531 9361 8761 8571 9811 9241 8981 9641 9251 9111 9131 8151 7871 8281 7921 7731 7531 6441 6301 7221 6301 7641 7581 6841 7021 7581 7731 6971 7041 7341 7061 6281 6531 6801 5841 4761 6081 4941 5181 5091 5761 4881 4691 5591 5621 4531 4591 4911 4941 5131 4961 4731 4841 4831 4181 4641 4181 3971 4971 3941 4721 3131 3841 3911 4091 3481 3131 3601 3511 4811 4251 3351 3961 2961 3461 2951 3081 3041 2771 2151 2311 2421 1881 2691 3031 3041 1571 1961 2631 1461 2141 2501 2671 2941 2281 2501 1951 2991 2501 1611 1741 1461 1281 0761 1211 1121 1001 0881 0641 0621 0771 0691 0551 0121 0311 0061 0061 0261 0651 0201 0331 0469769239871 0189831 0479651 0479631 0199649729759581 0089679309179149639319521 0189949919709561 0308971 027949868894873850919875877895879884934889860898829780838824801895839844793856854822846798763786792807810883797857782840831825818781770767772771693723765784738730794730758728734781725760736723735692746760729673689714704707689711670685718608622655647639668639659659643662639733650621650639621639624644656664642662652641658569589628648604601587608603649580608590611585594609608569604632605585529568624620623598613566606595573575614604558597607594571616599587591566501506528544531538570523495532539547558561545501543508540608575488510508517519523480521506520520526507597560508475500498470482462516442483479478450466490464449468477448487476481460504440448531487478508473481491503488491453457403496452436482494456430428425417433410396446466415431503429442440470446444481454459470483461497510481440451449443509472475451462453514499502488512490508475469530540514536532516523482548515566540545576541562518542512546555556576543588520579542548557543527560622517540557489513567569516551530531567525555522510526503531517518548527518553514530539521549544527511501544501507482493497515487492515516488485487470451445486467445439433427418423479420427396406392386431380381376383420398357364366437373422375377409364421401400434424409398392389388383409392372415377431405367372377396364407373375387402376423385409382377358386392403377416380415367358372390380411349415367368399345391338371344367358328354397335370354340338354347325349305308368365361379341384373369355331377349337371366308344356370319310314323281274312289299298281296260278296261329266261249271323270305281247286274272252265254231255287281286287258241246240276254245255249235260251275254234268281244252301242258273241245243243265 099100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 468 448000000059 490 818000968 461 242000000000570 330 4730000653 295 89300001 429 262 20100002 937 489 43200016 627 124 95100510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %153 579 71499.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %153 241 35499.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %338 3600.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %76 979 87950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %151 069 25298.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

9 %13 909 2519 %91 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 667 367143 42187 044174 477125 567135 187150 526209 82990 353160 83769 14361 94288 58599 21650 282120 20284 690100 580126 501172 590186 984174 266209 818161 883264 103441 63425 752782 02637 62537 71480 96080 90937 11996 78338 85339 18766 73489 23921 720133 9182 081 69490 32582 424142 149118 771223 609200 372274 947509 70051 69173 30161 83583 43438 71668 05873 90551 693208 41352 843110 284137 891 350051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.77%99.77%99.77%99.78%99.77%99.77%99.78%99.77%99.77%99.77%99.77%99.78%99.78%99.77%99.76%99.78%99.77%99.77%99.78%99.75%99.76%99.77%99.86%99.69%0.23%0.23%0.23%0.22%0.23%0.23%0.22%0.23%0.23%0.23%0.23%0.22%0.22%0.23%0.24%0.22%0.23%0.23%0.22%0.25%0.24%0.23%0.14%0.31%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped