European Genome-Phenome Archive

File Quality

File InformationEGAF00003058236

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

81 143 30341 805 75628 039 74520 464 60215 958 09913 107 90111 178 6159 783 4388 750 3607 997 1707 342 8606 840 7696 402 6586 023 7225 709 1585 421 9615 162 8764 946 3914 718 9184 529 2594 346 7054 161 3563 992 8713 838 1973 692 7843 550 7833 398 8213 270 0413 135 5023 006 2702 888 6972 764 3152 644 0582 533 4962 424 9322 314 1762 221 0992 126 0712 031 4631 948 0521 864 5191 784 8251 705 2841 632 5851 578 5501 515 2951 456 2491 388 9801 343 8661 290 1591 247 0971 200 8221 161 4991 122 3531 081 9351 048 5951 015 132981 042952 003924 989901 368875 614848 858825 395805 479783 828756 528743 296727 796706 064689 153670 567655 539644 781628 034610 461596 626586 406570 716558 406545 739532 674524 121512 038502 667492 221482 093473 859465 013454 917445 651435 014430 857423 135412 904404 129397 165388 789382 700375 993370 031362 561357 397350 201343 517340 878333 969328 432321 775317 052313 386306 966303 239298 827292 224287 933284 201279 900275 606272 526268 735263 712259 948255 521252 405247 219245 176242 103239 537236 502233 562229 197226 253223 201220 688220 150214 921213 002209 840205 719203 995201 280199 783195 123193 599190 993189 989184 881183 496182 970180 630177 706176 312174 390172 106170 283167 596166 302165 085162 538160 126159 417156 858156 106152 885152 081150 244148 986147 488146 027144 440143 588142 191139 329138 365136 968135 095134 141131 781131 078130 068128 822126 757125 969125 554122 952122 416120 796121 047119 136118 519116 774116 298114 224113 302113 167111 530111 789110 370107 609107 757106 487105 803104 755103 603103 198102 462101 55899 53599 73998 09797 64697 34896 47395 60594 49694 01693 58492 50791 60590 19389 69989 00888 07487 27386 84686 95685 30385 10484 54684 49282 64081 73382 17381 09679 98679 84679 17879 27878 63977 93976 92476 43075 31975 74973 66473 86773 84473 54772 93572 72671 30270 84370 50469 88069 30169 29468 67268 24867 46767 56266 80766 66065 32165 85665 24864 89164 61363 75763 35663 22662 51262 14461 76661 61460 82260 43060 68159 60059 27059 18758 77458 04858 59557 80357 18956 43756 48355 60255 12355 56754 46854 32153 91153 43853 79653 36053 28652 69552 43852 05151 90851 25651 34750 58050 75750 30549 99149 23449 09448 98748 88448 22848 38247 80447 65948 10847 07647 43446 79546 48346 78345 48045 76145 32345 98545 36845 05444 53844 18344 21843 68443 38543 16143 05943 16642 45742 40142 07241 63741 34041 22641 41940 71840 65641 04840 35739 81940 50240 14239 81639 34039 07238 94139 30939 07138 37138 54938 07238 18137 74637 80637 55436 57336 79436 70536 93936 58436 14736 16036 00935 66035 59135 32434 93635 23034 67034 54534 71834 66034 14533 90833 92033 72333 47233 07233 32233 14032 94132 51632 78832 59031 87832 09832 09131 42331 48831 68531 42531 15031 00030 88430 59730 24330 34930 31330 22330 06130 24929 63529 32229 51129 36229 45428 94328 95728 68928 83928 41028 26028 40628 26728 10128 23927 79327 75827 55427 57927 29826 68527 51127 32026 89226 60826 74226 72826 33126 25526 35126 02826 05725 65526 12125 91425 40725 64825 79225 69425 66825 17225 06425 10324 31024 49324 81324 76524 42824 65724 21824 03924 11424 21823 96623 95523 21223 82323 57223 50323 52723 52123 43023 02822 94923 09922 45723 21822 41222 14222 57222 58622 08922 16622 03221 91021 82921 42921 61221 54821 71021 38521 44321 47521 21321 11621 47120 83920 86220 89320 82820 75920 89420 86720 36720 70520 86320 33420 44820 48220 42620 23720 35219 94319 88619 61619 96119 85319 53419 65819 46619 42819 29519 11519 20919 05019 16519 06118 74118 96118 71818 71818 50518 46918 66318 41718 12818 52118 02618 09118 25518 06617 91217 56118 06117 57017 58517 91217 56617 46317 93117 60817 36417 51417 29817 20917 28517 14917 08017 26217 02317 11716 89216 31116 67816 76316 74616 45416 47316 31616 66716 18716 26016 47315 99816 00216 25516 03715 91715 67816 06915 79415 74815 58615 66715 51815 32915 44015 61115 38415 18415 12715 31715 34815 02015 09614 90614 80514 94015 17515 02114 70614 95214 91914 75214 82114 64914 46814 52514 70514 81614 30214 53514 46514 34314 40814 26814 46314 19014 19113 97213 88914 04613 65514 00014 08213 83713 90913 70313 76113 77613 66313 47613 77513 46213 37413 44013 38413 38713 28213 41913 07113 07613 18712 79412 98613 07212 96012 91412 91013 04112 86512 94313 01812 89912 92512 82412 68812 70112 44112 68412 55512 57112 28912 54312 44112 40512 25912 04012 33812 01612 26112 21712 16312 11312 05112 04112 50612 00411 83711 89111 91111 82411 94311 74011 93111 81611 65611 66811 69011 60111 47711 55111 43011 62111 42211 47711 14911 38011 27611 27811 15511 22711 28910 97111 18311 11611 16410 93811 20810 93610 95010 91310 82311 00310 89310 67510 85510 77010 74110 71710 92010 72710 67110 54010 54310 62810 51710 55810 50810 48710 45610 25810 36810 21410 35210 50210 43510 33210 51710 23510 21910 26510 23410 14410 08110 10710 03910 05310 03610 0539 9809 7539 9329 8489 8759 8139 7499 9269 6659 8029 6529 6799 6549 8229 7249 5939 7199 6219 4429 5599 4819 4189 4879 4999 4449 4119 5749 3129 5489 2389 1989 1519 1059 2209 3109 1519 3269 2269 0709 0039 0889 0818 9789 1139 0248 8889 1239 0029 1778 9748 8428 8638 7178 8608 6639 0428 9428 8338 8368 9108 5658 7298 8248 8148 6668 5338 4728 4918 6128 3648 4958 5418 2698 2988 4438 3538 3998 3048 4518 2998 2948 1938 2508 3038 4768 1108 0898 2158 0568 1077 9478 0847 8757 9787 9468 2418 1918 0257 8178 0348 1107 9677 9207 8487 9437 8717 8027 6577 9477 8967 7867 7867 9757 7337 7967 7197 7587 6417 6387 6787 6297 5807 5867 6377 5167 4667 6037 5787 6287 5887 6247 3697 5047 3927 3297 5377 4137 3317 3017 2667 2537 5197 1467 3397 2967 3977 2287 3307 3997 4427 2297 1647 2157 2277 1207 0267 1496 9597 0476 9557 0866 9037 0897 0626 9986 9706 9176 9406 7236 7036 9776 8156 9356 7546 7526 7666 9476 8456 8836 6646 9556 9046 9586 6546 6036 7156 7186 5846 7536 6816 5826 5966 5416 5996 6386 6906 5866 3666 5896 5196 4716 6086 5266 4126 4946 4396 4976 4466 4246 3616 5356 3176 3886 4836 5406 3306 3466 3496 2416 3856 1656 3006 1996 1486 2446 2886 2936 2216 2236 1656 2096 2246 1656 1096 2196 2026 2136 2406 0995 9076 2295 9056 1475 9896 0576 0846 1106 0075 9516 0136 0717 310 920100200300400500600700800900>1000Coverage value10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

003 712 296000000002 259 118 92100000000000002 894 296 3430000000000077 853 717 35700000510152025303540Phred quality score0G10G20G30G40G50G60G70G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

95.8 %873 681 52695.8 %4.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

0 %00 %100 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

100 %873 681 526100 %0 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

100 %912 207 087100 %0 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

0 %00 %100 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

55.6 %506 767 00955.6 %44.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

157 468 45024 956 88727 004 882848 144 961020406080100120140160180200220240Phred quality score100M200M300M400M500M600M700M800M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped