European Genome-Phenome Archive

File Quality

File InformationEGAF00003608796

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

56 898 57270 703 23074 065 57377 655 68965 353 52560 615 30145 288 12841 835 89329 997 70028 996 54220 828 28520 768 84015 113 13815 197 99911 306 70711 288 0828 622 3378 477 7336 674 4116 462 7845 265 4785 011 6314 186 6333 968 1243 414 8943 200 6592 831 4202 654 0172 381 9852 250 7972 064 9341 941 1191 795 8461 709 7771 602 6631 539 8431 450 7851 403 4651 337 1311 284 5981 248 7201 205 9811 164 0441 131 2861 098 8001 074 0631 053 3251 028 9331 003 956981 926966 238949 586927 813908 912897 526883 087874 062857 492839 815833 018818 619810 175796 886787 140774 494765 719750 651741 760737 713728 558717 829713 865705 137693 438685 446680 181669 450663 153657 118648 048637 594635 406623 387618 559612 348603 022597 137594 806583 693577 844576 360566 114556 645551 413550 085543 491535 913530 292528 037520 418517 696510 688503 654499 122495 788490 089487 182477 217472 695468 333462 666461 598454 045449 833446 992445 338439 913433 819432 855424 213421 186416 676411 885409 070405 578401 830398 714394 564388 953387 711382 560378 259377 814369 339365 237368 119359 731359 235355 423353 752348 452344 475340 744338 509335 288332 560330 189327 314323 318320 961318 117316 028313 792310 968308 958303 060299 662297 860294 657295 044291 918289 419287 390283 852281 231276 269276 474272 345271 106267 975265 878262 017261 355258 421256 632254 897250 386248 205245 949246 590240 795238 087237 287235 790232 306230 800228 887225 846223 622222 130220 129217 687215 673213 329211 967209 416207 506205 433204 332201 466199 581199 047195 016194 289192 301190 633188 226187 000184 117181 448180 564178 956176 422176 943173 386171 473169 791169 243165 656164 342163 615161 698159 607158 095156 385155 728154 005152 043149 833148 928146 821146 266144 441143 082140 830139 075139 088137 317136 484134 985132 638131 442130 177127 608127 549125 475124 585123 239120 819120 303118 785117 078116 340114 417113 242112 021110 526109 718109 164106 906106 153105 428103 672102 419100 31599 66598 50998 17696 04795 00094 44493 76490 97890 64289 12388 74087 21086 23885 73684 75583 60882 35181 85681 04479 53478 83078 70977 38876 16775 15974 12373 77573 04572 24371 55470 78369 17568 07067 49567 43366 36065 61864 99464 12962 40662 10161 30760 84659 92858 38758 42757 62756 64056 51555 53654 40753 80553 72852 74651 54350 78851 47650 12949 45048 63148 19246 79045 77045 74745 61744 24944 13143 55243 29442 80041 49941 71240 57140 03039 68539 22038 75538 71837 98037 57437 01337 23736 28835 61435 00635 11633 28233 47932 96931 93832 10231 23331 30230 78630 29030 06729 85728 96728 85028 76628 23027 34627 22226 96927 02426 12025 73325 23024 98824 80024 24824 42323 51523 29523 20122 71622 11422 32221 61321 33221 02520 57220 65420 40820 45620 24219 70819 53519 58019 03818 94018 58818 27618 16117 62017 57117 57317 18716 98216 49816 68216 23916 15915 68615 37315 35114 84314 76014 16714 36113 90213 64413 90213 27713 31013 26213 09813 03312 42612 40211 99412 14511 88511 97511 44411 16611 55011 40610 85210 69610 65310 20710 07110 1849 8609 5559 6159 3189 1138 8499 0138 9498 6068 5038 6038 4558 3658 1138 0477 8337 8147 8477 9047 4347 2907 4157 3087 2727 1917 0226 8876 6806 5886 7656 4336 4726 4245 9376 2536 0526 1906 0215 7555 6535 5645 5925 4835 3435 4715 2935 0245 1215 0534 8704 7884 9554 6014 5344 5674 5174 2654 3074 1684 1834 3024 0893 9843 9223 8323 9253 9573 6633 4203 5703 4533 5763 5593 4413 3343 3033 2523 2003 1513 1343 0953 1842 9973 0803 0502 9872 9292 8262 7562 9022 7812 7442 6962 6952 5942 6222 6742 5082 5432 5272 4802 4152 4512 2832 2862 3502 2912 2292 2442 1452 0972 0061 9752 0261 8901 9381 8701 8551 9041 8501 7671 6861 7221 6351 6901 6871 6171 6641 5651 5161 5821 3871 4421 4091 3951 4951 4171 3461 3921 2751 2761 2681 2831 2961 2741 2491 2651 2331 2241 1381 1821 1941 1101 1261 0661 0771 0621 0199819859819359609589249789739549509539669308919348148908378028467727188448098067517277367766867597057526936986996006826696256096686266456616306115935976145545815665415645645605364905936025235344794705165284875104794484994914884874734084234544744824204394134014614424414294154284593924043913393664263783993983993563673583383753543373413643673403423293203213053222963243263273233393113093263013083362943112843153282593083133353283072952852853042682772832452872622462503002712292312312402512192592442682242592162062582052462432451952502342282482322063182062312251912122162271942182362292182382321862012062432052222432392282352122542211922072312132092082062042071691871972061701882001661751952002931841781641692901651941781881691751701811211661641641621601461441351601591421631591511621721391521511781451511481231361571541261281391341391661291661341451311331251261221351131031039611311313396118112109123110971191151211131099695105109122104122124939411111511310512086961261028411193113981039793100113981041219296998510910488931039480117911099983112911061019111093929011722093831029888781007791867773821016876100907756665968907189845966806975657779807943 439100200300400500600700800900>1000Coverage value1001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 679 5740000000000524 465 2810000000000000837 040 0490000000000017 218 714 95800000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %122 941 35999.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %122 848 12099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %93 2390.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %61 532 78150 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

99.1 %121 976 52099.1 %0.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

76.9 %94 619 72676.9 %23.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

2 299 74521 89512 41438 91515 00217 29522 52036 90223 41738 99110 83110 02215 32016 4507 81032 2199 32311 01816 76219 30916 99137 28329 99125 32148 124109 5835 129523 1306 3895 61319 52313 9497 90432 6936 2196 53212 19717 6934 22748 828646 72828 32919 40145 96627 16875 38679 336149 697428 50320 39936 57829 35046 81815 02726 66727 72320 832139 13520 27652 179117 879 986051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.93%99.92%99.93%99.93%99.93%99.93%99.93%99.93%99.92%99.93%99.93%99.93%99.93%99.93%99.93%99.92%99.92%99.93%99.91%99.92%99.92%99.92%99.7%99.75%0.07%0.08%0.07%0.07%0.07%0.07%0.07%0.07%0.08%0.07%0.07%0.07%0.07%0.07%0.07%0.08%0.08%0.07%0.09%0.08%0.08%0.08%0.3%0.25%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped