European Genome-Phenome Archive

File Quality

File InformationEGAF00005283848

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

430 457 068258 432 374105 682 81866 652 72834 965 51922 465 28213 984 6429 728 7216 957 8075 345 7204 295 3303 579 9323 074 8082 707 1032 425 4392 210 3542 033 3611 879 5751 759 0311 652 2871 563 7951 484 2511 412 3811 350 5271 291 5331 241 4241 195 8271 155 2821 117 6251 078 5031 042 5331 013 741988 392963 644936 217916 448893 514871 499853 232834 550819 020798 995784 704771 305758 176743 850731 706719 013711 242698 845688 316677 147666 935657 283645 959640 205630 804622 221614 956607 070600 641591 801585 918576 532570 771565 059557 857552 726544 053539 264535 709530 108523 034518 326513 657508 317503 746497 835490 901487 489481 924477 073472 296468 987463 585458 219454 422450 430444 650441 206434 838433 064429 270424 938421 002416 369412 242409 142406 081402 318396 548394 408390 493386 154383 361380 424377 160372 580370 725366 801363 983361 086358 114352 405350 078346 296343 733341 704337 002334 528330 759327 356325 300320 959318 530315 349313 462310 305309 080305 686302 586300 446298 152294 623291 703289 771287 141283 913283 241280 703278 706274 962272 441271 825268 980266 070263 036260 691258 375257 694255 039252 317249 673247 794245 881243 389240 870240 108237 512235 329234 988230 866228 797225 607224 700222 196221 284218 191217 489214 731213 476211 194210 080208 546206 948205 353203 895202 600199 173198 057196 716194 465192 483190 358189 134187 193185 264183 763182 397180 456178 577177 360175 854174 278172 355170 671168 591167 584166 120164 635164 169162 113160 121159 166157 721155 944154 195152 608150 542150 106147 620146 519145 049142 815142 082139 825138 677138 191136 684134 752133 193132 598130 739128 863128 289127 252125 743124 623124 143122 618121 919121 309119 745119 011118 462116 471115 132114 122111 872110 888109 437107 616107 815105 995105 267103 847102 975102 041101 03499 33999 36897 77196 45996 23495 16594 57293 67092 74391 70490 88789 96288 62988 47686 93385 92484 88884 13383 06282 03782 01180 73279 09477 90377 64577 02275 80774 57774 56072 79372 55172 07371 25870 31369 80569 18668 20067 40866 98566 17765 50164 97463 92163 02961 75061 14760 83559 60559 73758 88258 02257 47956 78655 95254 92854 89954 07853 44752 53152 20051 36150 72950 30549 90348 97648 72147 89647 66847 14646 73345 80445 77645 15144 72243 65343 66343 46342 21941 96841 34641 15540 57840 22039 57339 19638 76437 74537 72037 22136 84836 56435 91335 67835 08434 97834 67734 01533 95233 48133 09432 99931 99031 63631 38231 04930 84030 37829 76829 41329 24528 90328 60227 74327 84427 59927 04227 07526 54226 34725 87725 04325 05725 00024 32124 33923 82223 61323 38522 96222 85922 45922 10621 81021 64121 44221 22620 82320 52720 25219 94319 61519 57419 35819 39719 05619 04718 30118 22818 03017 89317 45917 46917 54116 95916 87216 91716 26215 95115 96215 82315 76515 34715 23714 89614 78614 61614 34114 23013 96214 26413 82613 46213 43013 22313 05213 31512 97612 57512 57812 16612 25011 89611 72711 70011 44611 45111 21911 22511 10410 85410 83310 71410 63010 61110 31210 21110 01010 0649 9249 7769 7269 6139 3459 2819 2858 8088 8968 7828 7188 4598 5078 4708 1258 0257 9287 9297 7567 6157 5937 5427 3567 5217 3387 2397 0346 9617 0516 9396 7436 8646 4356 5376 2696 3056 1336 2446 1326 0415 9245 9285 6775 7225 6705 5015 5805 5315 4355 3455 3155 2455 1655 1965 1634 9245 0804 9244 8134 8324 7064 6144 6734 4694 3864 5274 4154 1744 1524 1424 1293 9813 9443 8853 9423 7883 8123 8463 7893 7683 7853 6933 5933 6073 5293 3743 3083 3483 2963 3143 2423 1713 1913 1252 9833 0672 9172 9592 9942 8482 8242 8002 9082 7432 8052 6852 5542 5492 4932 5252 5072 3832 4122 4592 4692 3142 3272 2882 3292 2792 2292 2562 1862 1852 1962 1342 1762 1932 1052 0492 0972 0002 0012 0381 9321 8711 8721 8961 8021 7961 7991 7151 7491 7941 7701 7711 7931 7021 7091 7351 6311 6441 6891 6611 6061 6081 5701 4901 5421 4801 5221 5031 4051 4821 4781 3691 4161 4081 4181 3661 3311 3081 3421 4021 3181 2771 2621 2741 3101 2751 1951 1941 2141 2051 2661 1921 1391 1641 1021 1271 1121 0991 0711 1021 1001 0701 0431 0051 0529939909691 0159429428818779119068928928648648897808287948107457637937667778037918058007968117577977347727397347878217437377057217276827246827097226936686796816716645986506666205785605756346105686036055846155145605375105295675154974965244985605025425095004904804914764844914715005044784734834675194674544604924554574574534714834634614354384624263954244224424414334434354274314004124144374644193944504014403933803784134053633533883783673833593623253693543533043343693483143202883262853213163333353133553163443053453023412993173193213133123052912843323072703132873022892983063032822573002912672752772722612962642802762682652922532622692652592282562532612892702422322782592672562542502452452462272512372102182332072462232192101822052322222032092101992102012171941801822061781982081561791991861761621831541811471721581511501501431391641631451561521621511531611511391731421431711591461601481431471481331291481461231401511511531491311191371341331481161601301551551341361381381351331271371341481181251251381271331161201091301271231211151111049810813312910011112011510912434 780100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

0022 511 9300000000096 429 21872 661 124118 606 643192 939 127150 951 78768 059 351033 529 78316 841 70947 871 30332 513 95250 373 40460 540 14318 510 49098 980 90558 752 30759 796 695105 032 249182 794 46348 896 125188 832 41962 690 350461 855 647654 093 443207 067 978463 635 422694 978 1039 361 916 2260000510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %108 011 84999.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %107 835 97899.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %175 8710.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %54 093 89850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %106 877 28698.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.3 %8 926 0518.3 %91.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

3 570 78964 51026 29996 46428 47126 385118 28037 45925 78871 44416 57714 81548 88120 5277 99436 30810 65010 91526 34118 60414 27237 58926 03928 70041 47989 5796 491485 6407 7537 21514 15112 8096 69535 4717 99610 42812 05815 9465 21644 658395 11215 10024 64819 25134 97941 45462 05566 902173 235475 68418 86342 74412 75346 70316 08315 89610 89585 75925 89346 203103 103 880051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.85%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.84%99.87%99.84%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.15%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.16%0.13%0.16%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped