European Genome-Phenome Archive

File Quality

File InformationEGAF00006164859

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

1 625 9151 801 4632 388 5573 592 4855 438 1948 035 70911 216 32514 829 81618 652 95122 674 61426 891 07831 821 00337 933 41145 769 23555 854 49368 314 78483 051 98499 347 090116 385 661132 895 572147 728 929159 769 505167 963 666171 980 419171 540 191166 806 300158 351 988146 864 449133 291 164118 449 566103 130 47088 073 76673 913 03560 907 87749 420 35839 509 34031 148 88124 267 14418 709 79314 302 54810 850 7528 193 5456 182 8154 676 6303 562 6862 740 0832 146 5981 713 3251 397 3631 162 599992 587868 624770 157693 526630 757580 321537 011496 982463 580432 129405 380379 659357 605334 135315 985296 643281 015265 490250 897237 880225 323212 030202 056192 880183 314175 282168 079160 255153 047146 812140 330135 326131 349124 937120 286117 046112 708108 619105 937102 28898 42395 73691 63488 68585 34582 76980 06976 95575 39273 82971 52369 09067 10365 50863 76562 21861 05459 97058 78057 79657 29356 50856 16055 00253 74252 97852 56151 85851 35651 25749 94949 69048 65847 75948 01146 75445 99845 65044 59444 08743 12442 27741 12739 70339 28637 60137 10936 70835 54634 41933 73232 83131 94431 40730 51129 55329 21928 11627 33626 77825 88825 06324 71824 07423 44423 21722 73021 85421 58520 91820 61919 76719 21018 92619 10218 41018 12318 04817 46617 00816 96216 92016 22816 38616 09715 95915 61515 46615 36815 66915 20514 87014 30714 20014 26214 08613 88813 48013 46913 41213 79913 41013 33313 10612 74812 22512 25812 29111 77111 89811 65011 53111 44511 02810 99010 70510 61610 48610 48010 47010 3159 98210 15610 1639 8489 6009 7359 3299 2999 3509 1489 0218 9068 9888 6878 7008 5388 5648 3578 3518 5888 0877 9627 8587 7267 6187 4697 5497 2887 1427 3467 0577 0597 0567 4347 1717 2007 0687 0667 0056 9977 0836 8716 6586 6226 7456 5106 6076 4916 4136 4486 2996 2196 0556 0485 7885 9605 9375 8555 7315 6445 7775 8145 7285 5645 5145 5375 4935 7355 4225 4005 3265 1865 0665 2325 1665 1575 0305 1414 9435 0154 8944 9634 8214 8974 9404 8684 8454 8534 7994 7394 6934 7354 6374 6184 5614 6344 6284 5204 4344 3684 2884 3014 3724 1834 2534 2294 2814 1024 2234 1224 1004 0943 8664 0524 0174 0513 9953 9423 9883 9453 7863 8013 8133 8723 7323 7683 7203 6803 7263 7553 7533 6453 6793 5523 5733 5763 6413 6273 6923 6743 6603 5433 5963 5303 5083 6003 4703 4933 3543 3253 4633 2793 3883 3313 3453 2673 1233 2363 2953 1993 0233 0503 2503 1763 2033 2093 0893 1473 1793 1313 1813 0322 9772 9212 8742 8022 8673 0752 9812 9112 8792 7602 7832 7932 7912 9072 9222 7932 6592 7562 6872 8282 7332 8342 7772 7952 7292 7482 6772 6842 7052 6792 6472 5752 6182 7132 5562 6132 5452 5622 4642 4992 4322 4712 5712 5612 4682 4732 4882 3922 3432 3562 4382 3632 3682 3262 3822 3942 4232 2982 3342 4452 1882 2492 1882 1562 2562 2752 2072 2232 2182 2422 1632 1082 1362 2022 1932 1112 0582 1512 1122 1142 1482 0442 1052 0672 1212 0622 0561 9712 0562 0422 0212 0101 9311 9381 7801 9901 9791 9171 9361 9341 8791 9061 9221 8391 8281 9051 7821 8221 7501 8871 8491 8621 7101 6841 7871 7021 7681 7781 7501 6911 7151 7321 7091 6621 7351 6211 6391 7271 7711 7071 6761 6711 7511 6851 7191 6201 6751 6511 6021 6211 5401 6531 6961 5631 6691 5461 5881 5411 5581 5531 6181 6541 5711 6111 5741 5281 5201 5341 5751 4771 4751 5531 5591 5271 4861 6311 4441 5761 4671 4691 4631 4971 4731 4161 4701 5241 4711 4461 4821 4331 4431 3941 3621 4091 4131 4821 4821 4321 3911 4121 4521 3911 3951 3471 3861 3791 4031 4231 3691 4111 4001 3741 3201 3331 2911 3731 4451 3161 2631 3561 3401 3151 2801 3411 3211 3141 2211 3001 2921 2551 3291 2581 2521 2581 2491 3001 2771 1981 2821 3181 2481 2021 2441 2421 2741 1421 1471 1721 1291 0811 0991 1391 1551 2281 1711 1561 1301 1591 1511 1161 1341 1401 1411 2911 1641 2231 1711 0691 1311 0841 1471 1461 1321 0801 0231 1491 1171 0641 1231 0201 0881 0781 0441 0911 0751 0231 0141 0591 0299831 0321 0411 0581 0331 0361 0571 0881 0851 0731 0091 0041 0371 0571 0921 0831 0169971 0021 0211 0091 0341 0121 0829559781 0379619469769439878879679889939729029129639339199209179559119759809169319379331 0049369458939188879039779679659409098999059689699438828858488959049349098749239119019229028829088848578878108568448359098938208448918719418688528978798568698699161 009971963824798897778799770788830841792825794805863893785859811789792847821833774797820833777832817808749779787771809762803831838784771791777879794757826790753740782721750716691734708692705685723714741750716715747720738690664702677710660682664670683677663665736704701656690674704704677657706688644683725697647616668619674666668692691661634695659640680675615612604624613613679634601659658666614625597646653609599732639638619631652697657605581668628632625632567634636582583577597602532576577576550578521589562548516563521566554527560517576526533591566514540504533559535537592537585596562570538554534530551537580567559533580546528565495568506544528579571541579503497504546529502507488548539523516533724 783100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 276 49700000000003 294 430 35900000000000004 778 952 6310000000000067 863 402 08300000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G50G55G60G65G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %501 977 87599.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %501 127 48899.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %850 3870.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %251 450 53550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.7 %491 141 55897.7 %2.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

16.5 %83 058 72116.5 %83.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

21 441 171519 208314 502843 883467 196476 929622 187832 680316 582532 467249 718221 787296 292346 893305 887398 463294 693375 925407 706580 953624 105577 309706 465525 393828 7931 398 18988 4542 468 491133 223126 188254 288265 675121 381314 285136 467134 032218 090297 55876 177437 7626 365 262282 698264 451455 852381 219721 881598 3831 010 4561 499 794174 357234 094218 689268 154127 739250 806248 104169 137658 013172 217356 865450 894 384051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M450M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.82%99.81%99.84%99.84%99.83%99.84%99.83%99.83%99.83%99.82%99.83%99.83%99.84%99.83%99.82%99.82%99.8%99.84%99.78%99.8%99.83%99.83%99.89%99.45%0.18%0.19%0.16%0.16%0.17%0.16%0.17%0.17%0.17%0.18%0.17%0.17%0.16%0.17%0.18%0.18%0.2%0.16%0.22%0.2%0.17%0.17%0.11%0.55%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped