European Genome-Phenome Archive

File Quality

File InformationEGAF00003609545

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

458 183 675315 117 826176 078 755114 818 85164 387 66643 690 10726 209 74118 695 41412 397 2119 388 2256 997 9385 703 5184 672 1674 022 5103 543 3653 197 5522 929 9262 703 6312 536 5142 394 5112 280 3142 170 1562 075 1571 998 1661 927 8451 857 8571 798 5731 747 3611 698 5211 650 5731 601 4501 561 1021 531 6871 490 3761 456 5671 426 7051 390 4091 365 9391 336 8221 316 7271 292 1561 268 8591 244 9881 223 4221 197 5311 177 1871 162 4091 141 6861 124 1641 105 2411 092 1851 074 5571 052 8741 042 0381 028 3041 011 3981 001 200986 092968 083957 761941 973931 120918 519903 602893 624881 965866 712858 837847 982834 158823 735813 104803 631791 518784 154771 478764 407753 891745 095734 090726 423716 819707 334700 142690 370680 051673 067665 267658 991650 222643 380632 857626 552618 580612 414606 061598 212592 934586 612581 087573 010569 243560 107554 714548 070539 049535 772530 455524 170518 573514 623508 797503 807499 010492 314487 013482 767478 145472 083468 772463 884459 964453 881449 259446 316439 228436 927432 035426 963423 212419 724414 707411 986409 165402 320398 061395 690390 447387 211383 007378 867374 739370 377367 497364 190363 091359 818356 108353 679350 063345 377340 501336 913335 538330 521326 415323 831319 266316 660314 270312 027309 072306 131304 088301 500297 702294 104291 439288 829285 948284 433279 863277 342276 273272 601270 532267 183264 852262 000259 351256 955254 649252 386249 930246 025244 131241 258239 694237 257235 454233 540231 205228 652225 559223 727220 497219 307215 755216 115213 934210 259210 121206 585204 703203 336201 018198 177196 539194 634192 374189 866187 066185 893184 097181 573180 271178 024176 374175 051173 304172 033169 145167 088166 250164 070162 294161 172159 025157 432155 763154 461152 661150 631150 201146 530145 482144 871143 035141 683140 073138 291137 732135 991134 022133 162131 054129 861128 149126 760125 554124 001123 092121 694120 668119 205118 231116 606114 826114 027113 211111 368110 224109 178107 545106 717105 881104 396103 480101 991100 89799 97298 85897 44096 62094 75194 18691 95692 09190 82790 24789 39788 89287 06686 57785 85185 27383 61882 38781 90080 55580 45079 79578 24777 82476 82676 03874 63973 73272 87272 07071 80170 90270 09069 41368 27167 97267 31265 87965 39564 42663 60562 77161 81961 46160 47859 76859 35058 84257 92356 89855 98955 76455 11854 41753 97053 21253 15952 02151 62651 14250 79750 01649 69448 12747 78047 30346 84246 11445 58544 67944 54743 78443 51143 06741 91241 74441 28840 65340 59939 99039 56838 94338 16338 07437 45437 53636 90736 22435 90035 52734 85634 80834 21933 98533 26633 11532 68532 33031 88931 44531 11130 69730 25329 86629 61328 69228 89428 39328 00627 34027 02827 15426 54026 06025 81725 24924 71724 94024 39424 46023 65323 48523 13022 79422 42122 17821 72721 38021 24320 90820 64420 11019 86519 46319 39519 03518 75618 43118 08318 30017 56217 50917 65216 98616 93116 69916 66716 17615 82115 73615 77615 46315 41415 16214 55014 78914 54214 15813 72513 51713 72713 33413 29612 77612 78812 78712 48012 39612 14312 07612 18611 89111 33711 53611 33011 07310 91010 71310 54310 35110 24110 2809 9149 8959 7369 5339 6089 4469 4469 1718 9388 6318 8418 5328 6238 1698 0868 0488 0297 7247 6437 5597 3597 0967 1197 2186 9506 7626 8566 7586 5466 5456 2686 1016 1536 0815 8825 7345 6305 6625 5425 4635 5515 3735 3015 2275 0664 9255 1234 9774 6774 7654 8214 7354 5894 6704 5994 4824 4194 3854 3454 3084 3573 9943 9793 9243 7873 8073 8203 6793 6473 5803 5953 4253 3093 2753 4443 3053 2653 1803 1243 0383 0273 0452 9382 8242 7642 7752 7532 6892 5522 5862 5412 4542 4822 4992 4582 2732 3962 3132 2462 1692 1662 1042 1492 1422 0652 1262 0092 0132 0471 9901 9882 0431 8691 8341 7441 6741 7101 6781 6391 6581 6531 6171 5761 5291 5311 6281 4991 5021 4741 5101 3931 4331 3601 3991 3431 2781 3461 2981 2741 2301 1641 1091 1351 1971 2031 2541 0961 0541 0551 0609621 0661 0339631 017940955952988955889944911934880887920837799819814845787799805750749789712694737719781723679764636690694648671647617654626680612576593599615546581541563538548490490469480499490500444436427430428445393439420418424451384417404422414406372418390349367370375364375381398403338377389354392385365433387347383344352372342390378370348370342363494345309313324340355329316324331303330322303323353307293275284315311325270285305280259237259270264252262280232229243249224240236241252258246217213220247192214219238215209197204169179186198199203179194211198201191196169173183177162224187170191197198177187192209192179187239240168174166177158183170206156162150173170179204170196167175182183167171151180191191176164190168184148163160172138149128138123135169153126146133145140131129142126127149140122122147121113111127109107102122121119103121129123981121091161121121309411412711110811312812411811612813011611711110011511612013811011611811811112210510511812010112111510711111113210010696113111921001011061151111001131031069611394121194107868910697899283789080826489737087809183909314596676969688683746563686266587065596752577164666778656354577021 071100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

302 8330000000000517 371 3580000000000000833 509 9680000000000017 306 013 74300000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %123 268 23499.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %123 106 51699.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %161 7180.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %61 778 80150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.7 %121 995 02098.7 %1.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

39.4 %48 709 34839.4 %60.6 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 307 84423 93714 68538 87818 72119 99025 10747 36026 52239 72312 23010 49216 17416 4378 05032 89310 95512 61617 48220 67118 83737 70231 96226 90346 596104 2044 960535 1536 5785 89815 95114 1018 34632 1056 4336 44412 28417 1874 19146 427623 10925 50517 06841 30323 39268 90471 020147 486422 47316 92933 02424 99742 49311 29322 20323 04116 850137 70017 83847 442118 473 340051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.87%99.88%99.89%99.88%99.88%99.87%99.88%99.86%99.87%99.86%99.88%99.88%99.87%99.87%99.85%99.86%99.88%99.84%99.86%99.87%99.87%99.88%99.79%0.13%0.13%0.12%0.11%0.12%0.12%0.13%0.12%0.14%0.13%0.14%0.12%0.12%0.13%0.13%0.15%0.14%0.12%0.16%0.14%0.13%0.13%0.12%0.21%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped