European Genome-Phenome Archive

File Quality

File InformationEGAF00001407510

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 421 3231 725 0861 459 8861 358 9531 274 4841 220 4901 200 5681 205 9661 224 5631 267 9101 341 3741 441 8721 609 0431 828 4892 147 3242 588 5383 210 1534 052 4445 187 0516 656 9158 569 94110 969 65413 916 41817 468 83421 670 72126 508 05831 975 37538 007 57744 506 00051 368 69958 493 40165 691 96672 816 82279 650 20586 089 66791 919 07296 941 690101 157 813104 429 786106 649 939107 807 883107 930 505107 026 258105 195 808102 427 40198 930 29494 726 52989 997 39084 818 88579 382 32273 751 57368 021 98162 330 83956 727 57351 290 14346 108 24541 179 70836 599 24332 309 42628 364 24624 817 50521 592 74518 685 77216 102 49513 819 06011 808 07210 048 2948 536 2287 225 8926 095 4975 145 4164 341 4583 660 5733 081 5362 600 4952 197 4681 862 0091 587 6471 361 6981 166 5301 010 542883 795776 924690 015613 757554 699504 490462 824428 472397 571371 601348 460329 268313 217295 077282 458268 219257 846248 102237 959230 450222 169213 672205 743199 056190 914185 049179 518172 880168 100161 918156 801152 753148 028144 991140 359135 921132 904129 071126 367123 324119 087116 365113 548110 902108 823106 896104 060102 453100 81098 57796 46694 75192 81990 37589 25388 05485 92284 34283 26181 70280 54378 71176 35675 85874 38972 99672 34170 31670 09367 87367 54066 15464 62763 83762 65262 31360 75259 26257 50456 95156 20555 18554 40153 25551 94251 96950 25249 81449 09448 22247 26846 38545 28344 08044 09843 04642 11141 09540 73740 11140 07239 28638 84938 41837 57237 12236 64235 49935 20434 33534 42233 76332 87633 03231 99231 82531 20430 59030 43529 75328 92928 77628 39227 82127 58026 92426 56426 45025 88825 74925 13224 70524 35524 03423 84423 52723 11123 08822 71222 30122 11221 82621 65121 39221 15020 91320 32720 74520 75020 26920 37519 99219 80219 47119 39318 96118 97718 86418 47918 36318 04817 53517 53317 37616 79316 98516 67816 36016 27816 18715 68115 66415 33715 20315 02314 74714 80614 45714 43614 44314 05014 33514 05113 92413 73513 49213 25113 32412 96112 95712 82212 88913 06512 66612 32912 39112 16612 14912 02311 88311 58011 41111 31711 29511 14711 20710 92810 98911 01811 10310 99810 71910 83710 63810 47110 29810 14010 19610 0259 9099 7569 6979 5859 3149 4649 3729 0259 1338 9308 8968 6548 6028 4828 4298 6798 1048 1198 2248 0067 9647 7857 7677 8327 5707 7107 6197 5097 5197 4747 3247 3407 3107 3877 0276 9707 0997 0207 0546 9086 8536 9596 6006 6536 6076 5786 7046 6246 2756 4706 5936 2346 2426 0946 0826 0395 7685 8605 9565 6385 6605 7415 6045 7515 6435 5635 4335 4135 3605 3435 5215 3165 4625 4735 5355 2405 2705 1965 0775 0145 0214 8604 8714 8524 7534 7864 8604 8014 7884 7404 7324 8344 8434 8384 7454 8354 7484 7004 7054 6374 7124 6764 6914 6414 4794 5374 5264 4014 3824 3584 3514 2604 1914 2474 2494 2914 0064 0523 8533 9803 9163 8093 8713 8403 8183 7083 7113 7653 6253 7203 6543 6623 6233 5363 6353 4553 4153 4833 5203 4813 3693 3483 3503 3003 1993 2713 1853 1603 2163 1323 1732 9413 0263 0383 0322 9962 9472 9683 0082 9062 8782 9922 8472 9502 8682 8842 8502 7782 7222 8092 7012 6532 6242 6272 6142 6882 6162 6792 5612 6022 5572 4962 4972 4982 4172 5002 4352 3852 4352 3732 3302 3732 3622 2882 3162 2092 1342 1222 2002 1852 1462 0942 1492 1212 0852 1842 1452 1802 1022 1752 1162 0552 0531 9832 0241 9821 9952 0082 0562 0731 9402 0222 0421 9901 9491 9681 9321 8981 8631 9841 7791 8781 8181 8901 8351 8211 8321 8961 7341 6721 8211 7191 7191 6591 7151 6431 6321 6211 6491 6811 6631 7311 7181 7051 6471 5931 7251 6431 6871 6821 6591 5871 6331 6101 4921 5391 6201 5611 5001 5611 5901 5501 5441 5341 5901 6211 5531 5861 5581 4761 4071 3691 4161 3811 2831 3871 3881 3731 3951 3891 3231 3021 3541 3881 3991 3291 3201 2811 2871 3331 3421 2811 2671 3101 3291 2571 2691 2491 2141 2221 2331 1941 2961 1961 1601 2471 2221 1481 1931 1141 1991 2401 1801 1431 1851 0611 1711 2221 1571 1691 1691 1351 1301 1141 1101 0901 0901 1681 1551 1271 1081 1051 0671 1201 1411 1411 1211 0961 1081 1531 0941 0471 1081 1021 0581 0951 0411 1151 1431 0781 0521 0851 0851 0871 0231 0021 1161 0681 0641 0181 0811 0631 0719691 0331 0291 0191 0279539541 0131 0231 0291 0171 0101 0521 0089889779789739759839989159359729501 0399959279019191 020951873966911907877908903903877874875918846893804871846861910839859908839817860812908859839832796840850805782895771811818813784755811768779833815835820824805789824779783746721790767796817737745812733717802793731780684796787757807774801813784702778761760763745753727740793734796729717710739703739624716724663698682709690694685663681709664652684615683647643686655676706672659669704678630622663648610649631623630605616597617631576640617603642605597573585592587556562584604608594610576593542530646546558576531529546566595583541559537561600574547567643573615564538617586585556586549553544517557558529526531573528486535531496542552531570535580521509588564570544561563566541522548525504565477514505497542504500508505544494488513526527499509537457530471492488499496475488476516454479551504523460479467510534475509466453473477465475462451493421456490461487494450513450487504430435467430427487453437599 460100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

5 903 501000000480 974 6750005 468 820 30500000000002 815 045 84600003 256 464 34400008 387 968 281000014 927 260 415000098 998 522 63300510152025303540Phred quality score0G10G20G30G40G50G60G70G80G90G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

95.6 %855 892 40395.6 %4.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94 %841 619 43294 %6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

1.7 %14 272 9711.7 %98.3 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %447 803 20050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

92.7 %830 613 75892.7 %7.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

26.1 %233 513 08126.1 %73.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

86 880 1011 568 67650 0234 428 90724 82817 30667 54812 7967 8888 67950 53114 35015 3384 7955 98390 22112 1425 3524 2855 70823 33915 9264 8569 7185 38215 95267 0846 13712 5426 07523 46287 9397 61626 1027 58820 369192 94215 14656 58128 49364 614356 85452 269306 4736 5679 4351 983 74921 10212 3666 3276 45454 22124 03512 2206 0405 59538 84230 78715 7227 267798 788 406051015202530354045505560Phred quality score100M200M300M400M500M600M700M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

98.72%98.38%98.84%98.35%98.79%98.86%98.32%98.25%97.99%97.25%98.22%98.63%98.98%98.69%98.81%98.18%98.42%97.53%97.66%98.41%97.52%98.49%78.67%98.99%1.28%1.62%1.16%1.65%1.21%1.14%1.68%1.75%2.01%2.75%1.78%1.37%1.02%1.31%1.19%1.82%1.58%2.47%2.34%1.59%2.48%1.51%21.33%1.01%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped