European Genome-Phenome Archive

File Quality

File InformationEGAF00004839880

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

61 444 822117 589 714181 617 204242 211 624287 330 500310 501 190310 167 330289 149 951254 282 240212 050 888168 868 511128 985 84094 999 48067 824 04646 966 12331 762 38921 138 64013 827 4558 970 6825 812 4563 779 8602 501 8741 709 3411 205 348896 526689 831559 491467 645403 632355 148314 560281 963257 682234 578214 256196 533182 327168 159157 070147 495138 298130 664123 034117 434109 712104 28199 51191 74886 13180 84575 85271 52768 90864 03762 17358 10055 57853 65150 89448 37045 80943 71741 21238 92337 19335 53433 68332 99031 47431 21929 77828 47727 35526 94325 38624 69124 04323 41922 53621 96221 93821 46019 92119 60520 04419 28318 76518 90418 12217 37516 99216 94916 67616 50216 09116 21915 29214 98014 79714 01213 72513 41813 14213 02212 79912 80112 58612 16612 00011 93011 47711 42311 38810 83210 48010 43410 27010 04910 0209 9199 8599 8789 6579 5799 1928 7008 6888 5878 3678 4158 0408 1007 8317 8477 7857 7917 5047 5047 4857 4997 2416 9187 0156 8297 0296 9306 9186 7916 5946 7006 5096 3916 5306 1036 0296 0395 8865 7745 7835 8515 7015 5375 5845 2745 3085 2785 2975 2915 1395 3875 0825 0265 0485 0315 2315 1465 1265 0254 9584 9814 7884 7744 6604 6394 5844 5944 8234 7774 6204 4704 4944 5434 4744 2974 2504 4394 2854 2764 4144 2074 2114 1354 0053 9924 0223 9063 8973 8013 8913 8553 7863 8203 7213 8313 5923 5903 6423 7223 6213 5423 5183 5073 5563 4953 4613 5123 4703 3253 2353 1633 0973 1753 1303 1163 0133 0053 1012 9962 8612 9992 9742 8443 0032 8972 7862 8402 8732 8272 8492 7912 6652 6372 6742 6182 6222 7092 6782 7672 6022 4652 4632 5012 4362 5252 4642 5022 4272 4462 4192 3552 3862 3842 3092 2562 2852 4132 3162 2922 2672 2012 1372 2322 1192 2262 1422 0912 2472 1002 1692 1782 1902 1482 0021 9661 9552 0541 9751 9441 9771 9491 9671 8851 9421 8631 8971 8881 8851 8571 9621 8091 9321 8301 7991 7711 8831 7461 7311 7991 7561 7361 7211 7261 7691 7661 7401 7921 7811 7071 6591 5931 6761 6091 6131 5681 5161 4991 5411 4791 4681 4591 4801 5181 5221 4871 4611 4741 4621 5311 4521 4061 3251 4321 4301 4361 3931 3971 4141 3791 4371 4121 4141 3971 3761 3931 4221 4301 3861 3781 3741 3221 2441 2601 3351 2161 2381 2211 2821 2561 3021 2961 3571 3011 2551 3151 2611 2891 3211 2701 2971 2491 2791 2141 2481 2221 3071 2371 2081 2581 2421 2571 1901 1761 1911 2261 1961 1041 1331 1111 1321 0821 1111 1631 1481 1511 1141 1321 1251 1111 0911 0411 0631 0701 1081 1241 0291 0171 0451 0561 0691 0599579611 0171 0661 0319771 0641 0771 0671 0561 0761 0561 0759971 0211 0151 0229931 0179969821 019977954928932926953903936931904856864893964879866943871862850895952887905881865836875908825805830891866788816881873870852852887812896856810794874772791821767762809821816793804771760772796766784789736790801780807777752786751717785785756754742791730745676728708739727748695725711615677635677694659710662738746713713669664689709651659716733724711721706711696670650751684687698661704704714627596644617657661628626659647619642661673700683643673643610617639662658692660640616663572611588622585600630600561551611562589578565593564588586581627539580569612582570557565535608561568509576525562561550557532558577520546548504576604556582537583575507518517509500514496540504534510491542540549534469461473497547494496496514454507440525514485528481498490550465494487461481498477474465488475511472500485510472501476481455457500504491553475493483470476434502444452429481454469498505476495485512488440479430488432470459470456429441472462468459441453462385461491443457436444444409478462463467561439464451442438439449436459405430395412421449441415407435443464389493414428411431425420388411396405425402412421414400440385392426374422405401430415419405461384409375418417407408415410406374427339400446400435398400412412395381376393389393386427419396386463396434393467433410387434409439431386361411426393420396382433412404416425360341345382423371375384351387388412389369389363384378399356332386333356366365366363357382319336351334367357348391397308377378397352401336365368328331302295377330330325340330338331319301329323334365324328333346329340304343327319309328272294291304303332326309315283281323302309274283291242259264277297285260274278306262283260300249241268243267309 129100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

19 012 348000000060 856 614000787 846 602000000000475 852 6050000589 219 59700001 293 844 29800002 710 790 01200017 419 248 56200510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %154 003 44799.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %153 853 26699.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %150 1810.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %77 339 96950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.3 %150 566 44097.3 %2.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

15.7 %24 359 17915.7 %84.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 656 228157 01496 177189 378140 389151 320164 529229 923109 097178 14681 60970 31991 014109 90856 796131 71994 000105 261129 224188 000191 965192 975227 266178 925280 172464 78532 861800 45546 42043 90980 50590 98644 519109 15745 11045 34268 91395 79427 116146 1221 950 14194 29087 455155 401127 148239 750208 175315 727518 22558 33077 69067 00285 59639 73479 87174 81652 266207 52055 149118 103139 010 494051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.9%99.91%99.9%99.9%99.91%99.9%99.9%99.9%99.93%99.88%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.1%0.09%0.1%0.1%0.09%0.1%0.1%0.1%0.07%0.12%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped