European Genome-Phenome Archive

File Quality

File InformationEGAF00002391400

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

29 141 89861 253 631106 869 477161 794 954218 457 658266 552 518297 338 728306 325 710293 767 161264 289 817224 403 252181 104 102139 555 202103 127 22173 584 89350 826 52934 155 31222 475 44214 547 9829 351 8826 010 7923 915 4782 620 3461 819 6541 314 6931 003 929800 800659 095563 730491 009433 271390 179347 250314 993288 973262 675242 116223 764209 701193 852181 872169 065159 931149 789142 244133 124127 034119 731115 190108 656103 67595 59689 30084 31979 85776 03971 79168 56364 53360 47657 07453 86951 45249 07446 11144 18641 49140 34237 76336 88735 09734 10032 85632 29430 97829 82128 35627 73426 56025 38024 92524 13823 64922 83022 45922 55021 96721 08120 65120 70019 65219 22318 97018 50117 88617 34817 06716 96116 38615 99715 67015 22314 90514 44313 98913 65813 21412 85112 99712 28712 16911 96511 71011 56211 43811 22111 06010 69210 53310 61910 35110 0859 94510 1369 8629 8199 4029 1209 0239 2429 1678 8448 4628 5778 6008 2768 3778 1968 2118 1357 9527 8367 7707 5687 6457 6787 5397 6667 3197 3917 1277 0756 9076 9386 7466 7306 8276 8446 6336 7496 6436 3746 1666 0086 2366 0755 8815 8595 9135 5965 6105 4415 4895 5085 2095 1585 3675 5095 2695 1435 1454 9794 9964 9794 9744 9244 7344 7594 7704 7094 7224 5374 5804 7984 6994 5964 4764 5524 3714 3404 2434 1804 2274 0644 1484 0494 0854 0003 9553 9133 9883 8653 8873 8563 7583 6413 6633 6193 7633 6953 6103 5363 3863 6223 4103 4393 3193 2203 2813 2103 2543 3233 2753 2743 0943 0593 1023 1363 1843 2083 1233 0173 1753 1772 9973 0032 9022 9002 9222 8602 8782 8772 8712 7092 8002 8402 8332 7942 7052 6632 8522 7072 7402 7222 5702 5602 5642 4652 4762 5582 4292 4122 4292 4332 3752 3022 3992 2762 2422 1302 1482 1072 2462 2122 2302 1472 2712 0502 1672 0822 1292 1642 1862 1432 1542 2282 1562 0462 0442 0402 0182 0322 0251 8481 9131 8911 8091 7861 8511 8191 8871 8651 9171 9231 9261 8181 8391 8381 8021 8231 8431 7771 7661 8451 7391 7821 7351 6841 7241 6091 7081 6291 6681 5871 5891 6331 5751 6651 6061 6641 6941 7151 5861 6371 6481 5681 6121 6481 6441 6001 5901 5591 4821 5401 5661 5371 4891 5551 6171 5311 6291 6011 5341 5791 6041 5551 5741 5681 5131 5151 4841 5021 4651 5081 4901 4531 4831 4041 3891 3861 4691 4561 3311 4121 3541 3531 3301 3231 2561 3601 3651 3151 2581 2631 2691 3551 2921 2891 3291 2961 3361 2761 2251 2231 2051 1841 2681 1341 2171 1761 1441 2061 1941 1891 1641 1231 1851 1851 2181 1721 1941 2131 1721 1961 1171 1831 1571 1521 1191 1181 1171 0481 1921 0491 0701 1011 1351 0251 0681 0281 1011 0641 0381 0789601 0921 1241 0701 0911 1381 0831 0511 1179871 0571 0409681 0021 0501 0701 0611 0551 0571 0081 0631 0119499499931 0151 014895952993961996981941917910935893874917936893926968939962947856936850902883855870893831864866818857839808813862861843846814806847751803773768823783771706741753784734779698793768776799729802818794812768776819803741768715730746731740725732680682705714661715697730716681703709677708692671656725665663655667698657675642688674684736698617700670671646647645654612616601601626622613598660601610597640605639637605623632618642590628630610585596565621593579570621583571538573564578575561608575559575567567537566519569554543559527535555543557576524526553527507513533509498539517521550511494529488506507510529524432518514469485498465482502510535529509503522480479536511481462524525509475512478499516500509534477499510471459440503478443448487399457436450463434499489461489502492486476492460449466458503466464436446481406419484419468429402454423447478432465452418489495471466419428451481449443421442453451446440449460420450423448456436468429421405418432459409414434457466437438419424430384405413378423383409421439441407408405409405413427407418387382447396411430425453410384377423384383388422393442369404386409414439382419400378395362372382371337388386400415388299351357317307335406337345315317320344324342342363330345335331342315338346334303342312342318329314324305347304307322293314312343333298296294274286318323303258314325296286279304304345259264320284276298287303301275296270307304272275272294328309271269302282293288273281242265262260291279240246267236265274272241264257252293279275267259285267291270280265265289276223244249252254277243262248240224240246222250226226242243237210238255239206241230327 509100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 672 4530000000221 370 1850001 789 330 0390000000001 101 912 98000001 359 673 92200002 211 516 65800004 377 179 20200016 011 303 52500510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %178 351 04599.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %177 654 97099.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %696 0750.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %89 652 18250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

95.3 %170 913 36095.3 %4.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

14.9 %26 692 89914.9 %85.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 767 052180 154119 749220 490172 035181 178207 904264 972136 140202 739100 44684 476120 109120 27866 311145 572104 720116 931152 277198 544214 102211 373263 451185 026303 193477 49442 865896 03957 34953 579107 192103 52359 955125 64657 32256 57391 914110 42537 687167 7172 339 536117 941118 982183 265152 821273 547237 764396 980608 09780 191102 54794 287119 79667 778128 055110 44586 923270 87291 191162 348160 878 705051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.61%99.59%99.6%99.61%99.6%99.61%99.61%99.61%99.6%99.6%99.6%99.61%99.6%99.61%99.6%99.62%99.61%99.6%99.64%99.59%99.62%99.6%99.71%99.41%0.39%0.41%0.4%0.39%0.4%0.39%0.39%0.39%0.4%0.4%0.4%0.39%0.4%0.39%0.4%0.38%0.39%0.4%0.36%0.41%0.38%0.4%0.29%0.59%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped