European Genome-Phenome Archive

File Quality

File InformationEGAF00003610014

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

181 553 187179 144 335131 022 777107 590 21571 199 07360 161 91940 736 03535 552 00325 183 30821 928 46716 250 35014 064 39210 856 4459 391 3707 481 0666 535 7155 420 3014 743 4614 051 0513 599 3523 188 4802 896 5312 609 3402 416 3392 236 1032 092 5291 969 7601 873 7031 787 5911 694 4471 638 9161 576 8741 515 8431 476 6071 430 0911 382 3521 353 5541 319 1621 293 9551 266 9021 236 5551 213 0811 180 6651 161 7151 139 0611 125 0091 105 8731 086 6191 063 6311 046 0851 032 0641 011 669993 612980 515967 067952 861939 417920 677902 528891 126880 497868 360852 477842 328832 530821 202807 983797 043789 437774 794763 825759 932744 618733 147729 658714 318708 775696 136685 261680 876669 928658 759649 766642 540632 658625 017618 190609 179604 623596 912587 189579 356571 398565 540560 192552 242546 776539 674532 026528 439520 104515 220506 541503 586497 281491 932484 508479 882475 379468 055462 927457 935453 791449 185441 112436 006433 095429 550424 596419 245412 770410 110404 148399 850396 882393 316388 143381 936378 787376 799371 011367 759362 047358 529352 920350 446346 772344 132339 997335 462331 936329 457326 072321 713318 082314 611311 889305 842303 229301 920297 529293 282288 947287 353285 334281 798279 564274 879272 644270 153264 883264 796260 021257 494254 697251 247247 763245 885243 434241 117239 219236 439232 513231 106227 437225 687221 329219 512218 534215 687213 387209 753207 399206 274202 327200 010198 547195 203191 645190 807187 994184 656182 948180 395179 480176 401174 048173 826170 817168 362166 462164 503161 600160 717157 290156 904154 309152 831151 074149 670146 577145 172142 572141 397139 968137 879136 975134 926133 543132 750130 375128 772126 327125 373123 767121 674119 030119 061117 397115 540113 605112 768110 380109 297107 561105 806105 393103 820102 603100 685100 41098 62396 53295 80693 51693 35391 42289 45788 61387 05387 01385 10383 77982 93282 14581 35479 28978 07877 07676 15575 64973 94372 95572 98671 51069 60169 43468 34166 68165 93365 16864 15563 34862 28961 23760 05459 12358 66457 65656 54955 90555 09254 36653 92053 11451 89751 29350 07449 97849 23647 72947 55547 05545 57945 50244 23743 29843 21942 32341 70141 36440 46139 76839 06438 70738 35637 11237 03936 36435 11834 84534 57633 88233 42832 96832 44332 18930 98830 27729 98329 51529 26628 64228 40527 79527 29927 18026 42226 23125 96025 80125 22224 87524 70024 12823 88323 64322 98122 53322 28621 45821 38320 87320 55420 38519 90419 62619 28019 05618 85718 29318 24017 79317 03017 20216 56416 04816 13915 95515 64715 36014 92214 58014 37914 09313 89913 44513 52513 09413 00812 58612 70112 22112 04211 64111 97311 55111 60311 47510 92910 87210 81410 76310 68310 27310 0739 8529 7249 6199 3979 4659 1869 2738 9108 7798 4628 6158 4168 0167 9507 8517 5947 6297 4937 2917 1757 1236 9947 0266 6966 4216 4236 5496 1296 1836 1526 1726 0696 0045 7755 7495 6365 4175 2615 2905 1455 0665 1364 9124 6804 5824 5974 3324 4134 3254 1134 3034 1073 9983 9813 8093 7593 8063 6953 7653 5803 5233 5013 4613 2703 3353 1353 0753 1463 0662 9982 8072 7722 8172 7642 7782 7222 6592 6042 5142 5132 3302 4522 3342 2822 3032 2402 2132 1052 0772 0992 0502 0252 1242 0161 8801 9431 9121 8671 9081 7271 7631 7561 7691 7451 7431 6381 6031 6021 6111 6001 5701 5421 5471 4841 4851 4971 4691 4661 4511 3411 3751 3101 2591 3071 3221 2601 2161 1631 1791 2131 1661 1851 1451 1341 0721 1051 1091 0031 1081 0269749751 00999292699089592089288886580684784187681082878176477874176072470865568172866765564264866261971070161360464264357659459153054357770252157153760652049050848151948448346448244248851644048243042644846544041143841142838741042142339238737737236937136738041936335233434133735034038533335233232834332732731331031632634531329629733333228534629826826425128428529830125225326424724828324123628525623923124223025324024224321722319623821223821720322823729219621119921922324022318120822119823222624622623322820721518321719922121418617334621832919017718817820518621720114617517618118919618318618816118015617716315517917415716718715715914519618316917015517218119115319216917616816018017117618617416917618115417915114216816513814015515814315416912914914714313614715213913914614816214427516814714414412611914516711313512011310913210012311712713213812513012211510513512314112314012811711910410110711511511010610312613510710693101971239810011897899011598122106114108100109103116102125101102103778484949077108761029073106808099908610580957288789586105898594988184759092908995808096837886947172758592707782797892997780687585779682779268828081836354546264706986827175658071635871597969727577617271846977607264517364677663727972767078897967816578918179838372738975838279938479839164536792667919 701100200300400500600700800900>1000Coverage value1001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

369 9990000000000406 861 9460000000000000695 580 4330000000000015 046 621 31400000510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %106 862 12199.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.9 %106 792 05299.9 %0.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %70 0690.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %53 474 94650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.1 %105 993 72899.1 %0.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

60.7 %64 873 56860.7 %39.3 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

1 951 00119 53410 52333 04413 38215 54420 05534 69520 36934 4999 3548 04712 85114 7206 57229 2469 42911 33113 64217 29816 03733 51327 89023 40241 59699 5764 099459 3205 4275 27216 76612 7066 70128 3455 4985 88010 34715 5333 78241 391563 41324 06116 06139 83622 99966 20767 595133 637380 21216 27931 64723 99341 09111 72922 36622 27917 061119 90815 97442 733102 427 644051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.94%99.93%99.94%99.95%99.94%99.94%99.94%99.94%99.93%99.94%99.93%99.94%99.95%99.94%99.94%99.93%99.93%99.94%99.92%99.93%99.94%99.93%99.95%99.75%0.06%0.07%0.06%0.05%0.06%0.06%0.06%0.06%0.07%0.06%0.07%0.06%0.05%0.06%0.06%0.07%0.07%0.06%0.08%0.07%0.06%0.07%0.05%0.25%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped