European Genome-Phenome Archive

File Quality

File InformationEGAF00001689048

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

226 210 34886 204 64139 251 58721 277 05412 893 9568 580 8516 142 6254 651 8203 687 4643 026 8992 542 0542 185 3691 914 5951 709 2671 538 4681 398 4401 290 8441 196 4081 117 4561 047 915987 257934 618882 642841 886800 160766 232731 197700 702667 526642 253619 833595 256573 649552 988537 067517 237500 262483 731471 163455 848441 696426 824414 537403 456388 714379 709368 480358 474348 927340 613331 403323 795314 719308 489300 864293 026284 748279 832273 791266 959260 478255 334248 667243 058238 504233 903228 600223 946218 434213 417209 652205 670200 869196 616191 888188 196185 118181 284177 989175 474171 095167 459165 477161 503159 071155 774152 575150 552147 360144 420141 393140 131136 797133 421132 849130 826127 428124 955123 456121 929119 571117 427115 861114 650111 679110 143108 192106 592105 139103 482101 475100 71898 96497 14295 34693 28790 96990 20889 78087 56786 62885 04283 40382 50880 99079 74778 95277 96176 02475 19374 44573 17872 21871 14770 43468 55568 00966 96565 93764 91363 81863 27362 45961 63361 30060 95459 76258 41957 91757 19256 95855 07354 98854 06253 08652 61951 94251 72250 92649 92449 30348 42448 03347 63747 03345 93445 65345 25344 76544 53243 20442 60342 61942 20541 35140 80340 12140 03439 25139 28538 61538 21138 05837 27436 44736 20935 86235 52935 32834 70434 39133 95833 39233 44233 20232 48332 32031 53631 85731 23731 23030 44630 09930 03429 30229 11728 81528 74828 46327 94127 78527 33726 83226 60726 06626 54226 01325 64825 28725 18424 50924 48624 22223 80923 57923 54023 37422 97522 82423 16422 63322 20921 74921 85221 53520 86321 35221 20020 76420 55820 74720 25820 11219 57719 81519 63919 21619 31218 65418 85418 67018 58518 38718 22318 32417 71917 37217 41917 44116 77416 89416 53016 56816 42216 26716 10415 83315 54415 70115 44515 41715 37515 24914 69214 72114 52514 59614 15414 10314 14313 92613 98213 63813 73813 66513 69213 34213 31213 13612 79212 83713 03512 73312 71512 47712 40612 29412 31912 08812 20712 00911 86911 92711 99911 90711 55711 59511 68711 70311 41211 27311 39711 04710 96610 97410 76110 96310 81110 59910 60310 40110 51010 10310 19910 0499 8969 9309 8459 6709 7429 6469 6919 3369 4349 5089 2769 2679 1859 3369 1138 7889 1069 0018 8738 7498 6988 7138 6058 5278 6158 5358 3848 2608 4588 2008 0167 9647 8957 9247 8017 7397 7777 6787 8607 5097 7367 8477 6477 5737 4657 3677 4227 2977 2717 5017 2497 3687 1927 2126 9717 0596 9286 9137 0626 9356 7236 7296 6466 9386 7636 8296 7096 5386 5366 5396 3616 2316 2786 3026 3956 2966 2516 0926 1676 2486 2036 1185 9526 0295 9646 0365 8465 8365 7615 7875 7175 7425 6875 5825 7875 5465 5415 4325 4465 5995 4025 5705 3205 2425 3165 2445 2355 1275 2165 1715 2595 0675 0745 0744 8594 9744 9464 9884 9264 8734 8334 8794 9224 8284 8454 8894 7854 7424 7044 7024 5634 6934 6704 6404 4254 5034 4094 4464 4434 3704 4034 3604 2754 2894 3584 3124 2964 2684 3014 2044 1804 2374 1214 1524 0434 1344 0814 0374 0503 9873 8554 0363 9003 9243 7543 8253 7353 7613 8613 8593 6133 8003 6383 6343 6083 6103 7273 5663 7923 6423 5643 5213 3823 5553 5603 4583 4153 4773 3493 4483 4333 3523 4413 4263 4143 4353 4103 3653 3403 3263 3373 1873 2513 3133 3073 1063 2483 0663 1683 0633 3113 2073 0543 2123 1002 9713 1393 1192 9163 1313 0463 0803 0183 0893 0083 0183 1393 0082 9962 9092 9502 9472 8602 9052 8642 9022 8372 8242 8722 8452 8382 7672 8662 8482 8782 9052 8092 7932 8772 7692 8822 8082 8272 7222 6762 6862 8052 6252 6932 7102 7532 6562 7462 6792 5572 6422 5702 6742 5752 6552 5922 4992 6282 4752 5332 5042 5022 5152 5762 4902 4772 4322 5642 4852 4522 4952 4042 4962 4312 3822 3732 4642 4192 3182 3972 3332 3342 3632 2652 4062 2732 3092 2692 3092 2212 1832 2362 1622 2032 1522 2712 2632 2392 1872 2112 1312 1282 2062 1342 1442 0532 1492 0402 0962 1492 0562 0822 0512 0522 0242 0262 0672 0411 9962 0971 9902 0402 1022 0371 9882 0582 0231 9321 9222 0281 9781 9111 8651 8631 8711 8711 8751 8771 9261 9391 8511 9021 8831 7801 8452 0051 9751 7991 8361 9711 8901 9071 9271 8821 8061 8071 7751 7481 7431 8431 7601 8311 7431 6851 7161 7741 6931 7561 6511 8031 6741 6711 6251 6461 6871 6611 5781 6541 6511 6071 5121 6311 5751 6551 6351 5201 5491 5641 5151 6251 5171 5781 5351 5161 5581 5581 5451 5041 5111 5421 5431 4461 4811 5861 4861 4381 4691 4481 4521 4441 4551 4391 4051 4271 4291 3891 4201 4601 5371 4231 4711 4151 4471 3851 4601 4751 3941 4501 3951 4461 3691 3621 3731 3221 3121 3441 3041 4081 3561 3331 3551 2911 2261 2621 3111 3101 3401 3621 2651 3781 3241 2941 2951 2711 2531 2891 2731 2731 3021 2571 2301 2631 2561 2531 1581 3081 1691 2281 2381 1841 2461 2171 2091 1521 1111 1921 1811 1171 1581 1491 1561 1361 1941 1921 1351 1721 1071 1441 1191 1491 1251 0371 1251 1001 1591 1261 1091 0631 1331 0861 0461 0961 0271 0431 0431 0541 0401 1151 0511 0461 0071 0791 0711 0181 0761 0971 0341 0711 0201 0269759831 0451 0181 0211 0311 0631 0301 0451 0239629531 0001 0229149789309389681 0219529899299859291 004956984911886932860943895950884896905933931914907898864896872875939876862824864810869836847845806836805856790853861769810764812814836831783831782845815825762809822773781777822761796765739771696766770769712755738736735791761720773721756759755745727705713730727749727771674700713670751692731697728695699692740746689698647754695681761700733669662485 895100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M200M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 076 5390000000000000341 002 558000000022 762 3160000300 412 60400000915 727 6870006 308 279 84600000510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G4G4.5G5G5.5G6G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.3 %103 363 00898.3 %1.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.3 %103 363 00898.3 %1.7 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %00 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %52 595 07750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.3 %103 363 00898.3 %1.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

27.6 %29 032 69827.6 %72.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 154 6025 646 95832 591 36884 602 776020406080100120140160180200220240Phred quality score10M20M30M40M50M60M70M80M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped