European Genome-Phenome Archive

File Quality

File InformationEGAF00000642652

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

588 218 800181 784 36553 449 45519 988 0209 180 8265 562 2024 024 3253 267 1112 809 7652 509 1242 283 5122 110 8811 964 5121 852 8921 749 3881 661 8781 583 1301 515 8001 457 7021 402 4741 354 7041 308 9751 268 2311 230 2261 191 5851 160 2181 127 0501 095 3501 067 4401 040 4381 015 770993 039972 066947 506929 652910 873891 967874 072856 769837 010818 337802 196789 478774 940761 104746 180731 982715 635705 316693 244680 650667 875656 445645 133634 532623 348608 861598 986588 968578 084569 261558 483548 811538 663528 127521 390513 113502 891496 070487 942477 999470 853461 415453 547444 993437 014428 921421 061412 062405 067398 337390 041383 382376 503369 850363 333357 006351 248346 112339 428332 733326 575320 223313 284308 573302 297297 398293 329286 143281 134276 447271 520265 167260 601255 469251 135246 023242 675238 080234 621230 204225 124221 015217 362213 164209 284205 201201 117198 065194 746190 660189 518184 794182 101179 830176 614173 073170 037166 976163 538161 324157 786154 851151 843148 866146 649144 503141 996139 403137 460135 328132 771129 744128 348125 740123 352120 748119 277116 292114 826113 017111 211109 724107 687105 223103 944102 354101 26299 35597 68496 46094 39492 85391 40389 94188 06186 96985 02484 13182 76581 37980 09079 29478 20876 82476 00674 93373 75572 59370 76670 25168 28067 83066 94265 36964 80263 33662 29761 55660 93859 45858 61357 90856 96456 33355 33154 91954 19952 80751 94251 14450 51749 44448 87348 07347 35646 55046 07045 17444 86044 20043 55842 97042 33541 72140 89840 37239 96639 48038 61638 48837 73237 30036 58736 28335 43935 12334 65034 25733 82333 43432 93532 66732 00231 56630 78630 76430 48429 46629 56528 89729 00828 34327 77127 14526 81226 82026 37026 04825 80225 39825 12524 51924 22624 16823 69122 95823 04022 24222 32221 84521 60621 21121 21020 72320 56920 19419 91519 54619 73119 54418 96118 84318 70818 42518 47218 08317 92017 67717 34317 27316 93016 66616 55016 43516 18316 16615 77115 42615 18415 31315 02214 91714 45614 60414 17114 00013 82013 73713 65113 58813 66813 12713 18012 51412 49212 49912 50912 10812 24911 98311 89911 71911 54511 41711 14511 27210 94210 75610 76610 55610 34810 34810 24710 19410 10310 0009 8119 5989 6819 3349 5899 3529 1959 1309 0238 7358 7348 6848 5498 7078 4638 3398 3658 0468 0697 8778 0267 8137 6937 5087 4847 3787 3247 3217 1427 1477 0366 8236 7916 9816 9166 7626 6876 5626 5516 5816 4306 3636 4786 3326 2615 9566 0616 0245 9325 7995 9235 8115 8635 7025 7085 4625 6145 4435 3755 4885 3155 3345 2555 1935 1625 0025 0795 1614 9575 0544 9174 9234 6514 8284 7034 4864 5934 6114 5034 4614 4264 4854 3304 2644 2144 3724 3564 1924 1444 0834 0924 0564 1494 0014 0783 9823 8603 8884 0203 9143 8683 7483 8153 6283 6743 6153 5853 4563 6223 5693 4693 4343 2633 2813 3883 2293 1473 2643 2193 0893 1612 9983 0253 1123 0003 0082 9483 0223 0352 9812 8262 8032 8622 8072 7542 7202 8582 6982 6942 6662 7272 7002 7412 6112 6212 6082 5782 5322 5652 4712 4872 4942 4962 5762 5072 4002 4602 3862 2862 2962 2962 2392 3202 3062 2372 2362 1592 1062 1502 1852 1292 0592 1622 0932 0672 0902 0292 0522 0101 9382 0071 8951 8821 8811 9221 9151 8141 7941 8741 8221 7721 8681 8271 8431 8051 7991 7891 7801 7951 7561 7841 7281 7531 7071 6921 7211 7561 7051 6861 6931 6951 6711 6361 6581 6231 6351 6071 6141 5781 6051 5891 5521 5551 5031 6071 6221 5451 4901 4831 5241 4691 4971 4501 4331 3871 4311 3941 4191 2901 4041 4031 3171 3931 3551 3601 3011 2961 2711 3171 2841 2331 1941 1701 1951 2891 1891 2411 1741 1871 2041 1991 1331 1611 1481 2031 1751 2041 1471 1341 0751 1071 1141 1651 0971 1611 1031 1371 0691 0911 0901 1301 0761 0381 0141 0741 0331 0581 0801 0051 0071 0519831 00993997898592296799794398792092492688786688387186385087791987281780584889288484388185881981481582283883677983978280084481276777077779778976276475175876572369869074472470871168668969572674666672769969469368167669367066161567164667063064463763960764064464659464061161761561757065160463462062560560257965960258960259764156458955958256853356558855662457455958955954155456456951454058454461853455056554052551452851351449456052349653045452946850151348046752246251144746649143046646743745847744545544339044344342643144043443839542641539938539539140140938439838736142741141940340740942440940940542944736136638442540139041839840738238834741136842436638038235739336334837736535137634536735933533832832532933734331231933631733934134335433930535633633533431936632131028130231429930632333528529030929330830028827234029629826926325928726129727429825627725228026227330025925924325225427224630725027825126926326727124326029825926326425623124922124223722825422622422821623523220920023222120621120422823020420625122225419922921522720422722623521123921521621720121623519421621921821117918619517818622821519521719619620020719017817921318616521320221118216718020860 858100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00548 40807 879 2727 524 31210 176 0397 619 14015 529 6989 310 37612 633 39014 349 53212 982 94117 836 24813 058 57813 079 37116 201 33116 235 99119 989 86819 478 90423 373 72226 541 43126 299 61825 254 63732 934 19436 076 77436 349 89951 799 17544 213 51357 807 60467 260 46487 993 34272 516 816164 394 726139 896 123255 727 653434 653 391691 962 781593 991 4504 410 592 3279 344 26100510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.1 %99 143 07999.1 %0.9 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.8 %98 841 21298.8 %1.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %301 8670.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %50 022 78250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.5 %98 590 42498.5 %1.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

5.1 %5 079 7285.1 %94.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

9 947 8508 5566 73714 6268 56820 29818 69039 92131 89994 35980 76022 976180 13532 32730 044410 47265 561204 959132 62452 570272 0963 107185 359637 7742 81745 4793 6393 5963 5853 343 64010 2877 8509 63412 27212 38818 806398 5751 241 53632 2488 82060 69232 2104 10688 1924 7509 550265 78810 17421 85815 15637 87616 95054 87650 89281 394147 566358 02081 128 074051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.73%99.75%99.78%99.56%99.78%99.83%99.56%99.72%99.4%99.57%99.81%99.87%99.82%99.85%99.43%99.44%99.7%99.7%99.79%99.83%99.65%99.69%98.39%99.53%0.27%0.25%0.22%0.44%0.22%0.17%0.44%0.28%0.6%0.43%0.19%0.13%0.18%0.15%0.57%0.56%0.3%0.3%0.21%0.17%0.35%0.31%1.61%0.47%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped