European Genome-Phenome Archive

File Quality

File InformationEGAF00004835442

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

196 986 691141 723 65790 585 57474 217 75157 315 16048 198 54939 974 77534 178 78729 256 29325 378 75622 106 90719 382 42717 067 26915 084 43613 414 70111 957 21410 712 4729 633 0628 681 3877 835 6177 089 5886 439 2915 869 2225 341 3684 884 7724 471 7234 115 9643 788 5633 494 9393 228 7272 995 9342 775 0442 579 6372 398 3772 238 3482 094 2321 957 8301 837 0861 725 5651 623 9481 530 3151 441 6791 366 9961 293 4371 227 0031 160 2361 103 0601 051 091999 120955 298911 506871 239834 734798 230766 169735 779709 068681 399656 712635 272611 296590 549568 476550 704531 998518 325500 783483 538470 979458 500445 678432 016421 152410 691400 136388 653379 537369 984361 431350 134344 740336 490328 109320 854313 614306 444301 321294 642288 221281 224275 988270 953266 950261 491256 333251 186247 080240 652236 732231 877227 434223 289219 570215 554212 076208 630204 996202 233198 235196 539191 828188 633185 417182 518178 871176 668174 161171 000167 646166 268164 186161 651158 639155 256153 801152 157149 593147 170144 805142 511140 819138 929136 779134 266133 631131 070128 441127 366126 825123 941122 403121 262118 511116 748115 250113 995111 589110 775109 444108 024106 054105 336104 080102 883102 136101 17399 03797 28796 47595 74594 57193 29792 74191 59890 16188 59288 01387 27086 67885 04283 88183 26082 16480 46079 79178 39578 23677 30975 69874 65473 50972 48772 32271 24970 45469 18368 40567 43466 62866 51865 10464 53963 77663 48163 04461 94061 03260 69059 73159 17159 14558 49957 54257 15556 38355 39755 59354 73553 91553 49552 85251 94751 40450 84450 02849 67549 42449 13848 86447 97247 95946 91746 65546 08145 23445 18544 50244 44643 82543 37042 83642 41841 99341 14840 42440 09940 32239 47539 03438 49538 16238 00237 92236 99836 63436 41936 08135 40434 63534 53634 28933 65333 51233 45132 73332 46332 26831 60430 83530 53831 13730 39230 27430 05029 84429 55928 76228 83428 28428 17828 04027 87027 72927 41626 75426 60326 29026 27426 05225 70325 38525 46225 31225 03324 87024 78924 26223 80123 69123 57123 13323 08122 74122 63922 26322 50722 33021 61621 27721 39920 96020 73420 65520 24720 28820 45019 96919 71919 85119 69219 33719 21919 07018 80418 64618 32718 32018 17817 90618 00317 71517 38317 27417 19317 40416 68516 67816 51216 28916 18716 05115 74115 66515 80415 67015 49015 28915 14315 25514 89715 13314 91714 50214 38314 14314 20513 96413 73813 80313 59513 78613 57813 55713 23713 33613 01912 88812 93912 72212 60012 72712 61412 59912 44612 24812 23612 07411 83011 75711 73011 77411 75411 42811 25811 34011 28211 08911 04811 02910 94810 83810 63610 57210 52210 66910 44810 39010 32710 28010 12810 12810 1199 9119 6549 8269 6659 5639 4849 5009 3659 2149 1568 8299 0978 9678 9179 0519 0428 5718 7388 7728 7238 4138 4068 4148 2768 3848 3628 2208 2678 2498 0358 1058 0587 9197 7837 7037 7997 6797 7157 5807 4677 4617 4177 2507 3807 3637 5017 2047 2636 9717 1187 1617 0076 9846 9136 8696 8836 6966 6376 9336 6806 7696 6396 7316 4946 6406 5056 7096 2916 2686 1846 3386 1206 1626 2116 1186 1656 0205 9955 9916 0326 0105 9805 8335 9275 8335 7595 7585 8555 5905 5785 6185 6105 4945 4765 4935 4285 4335 3985 4235 4355 4425 3405 3265 3265 2455 2685 2005 0005 2255 2525 0925 1005 0484 9854 9424 9254 7434 8134 8624 6344 7614 7274 7484 6684 6314 5814 6404 6304 5394 5924 5384 4354 4334 3914 3234 5304 5474 3904 3224 2524 2234 2874 2814 0644 1934 1974 1324 0904 0714 0504 1464 0453 9714 0454 0043 8874 0004 1393 9683 9584 0523 8933 9473 8983 8533 7953 8583 8663 6783 7193 6833 6223 6593 6863 6603 7543 5373 6093 5483 5603 5613 4963 5363 5893 4023 5183 4833 4963 4083 3243 5143 4213 4343 3843 3663 2933 2843 2463 1573 1803 2923 2653 2323 2573 3213 2993 2243 1873 1323 1573 0183 1383 1213 1323 1582 9342 9343 0263 0143 0122 9912 8892 9652 8533 0182 8792 8922 9612 8262 9162 7882 8582 7022 7952 6622 8362 7622 7852 6692 7332 6912 7362 7062 6872 7232 7062 7312 7062 7082 6752 7272 7052 6292 6472 6692 5552 5602 6452 6522 5812 5782 5282 5752 6152 4542 5792 5242 4112 4532 4812 4822 4362 4812 4092 4022 4882 3972 3622 3762 4112 3642 3362 3262 2712 3582 2322 2652 2632 2642 2712 2782 2542 2172 2642 1592 2022 2172 1392 1532 2522 1192 1472 1142 0752 1492 1132 1562 0452 1082 0302 0512 1422 0762 0402 0551 9531 9802 0502 0461 9761 9541 9371 9371 9812 0071 9881 8582 0361 8951 8941 9091 9551 9471 8831 8851 8641 8521 8191 8791 8731 8681 8081 8151 8341 9141 7731 8001 8641 7121 8961 7771 8191 7381 7571 7131 6231 7571 7181 7371 6671 6181 6651 6141 6551 6361 5931 6691 5971 5921 6321 6271 5501 6541 5771 5511 5411 5411 5311 5071 5511 5411 5411 5181 4711 5071 4871 4341 4501 4611 5451 4491 4181 3811 4231 4021 4671 4431 3841 3451 4051 3501 4021 3721 4081 3711 3281 3571 3911 4121 3781 3071 3681 3221 3281 2651 3931 2781 3771 3181 3721 2861 2791 3591 3251 2481 2741 2791 2001 3211 2901 2751 2621 2281 3001 2601 2051 2481 2241 1831 2391 2561 2241 1711 1851 2371 1901 2301 1911 1661 1601 2151 2181 1651 2211 1831 2431 2351 1951 1501 1511 1791 1921 1581 1591 1981 1361 1851 0941 1511 0671 1661 1231 0651 1131 1211 1771 1361 0771 0631 1011 0831 0671 0781 0331 0741 0401 0561 0491 0039971 0431 0689841 0321 0011 0301 0509619789591 024977951950990904949917967951921945927952955885860926880907907868974881883868854858924921825921866903917842894835852840805884795838818901817824822820858827855839802837808860780825796782752794766830823785778781783803797819819796792784777783744756818737702755721737723719734754741799654734706749716709703744784757709698727701491 535100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

003 518 4130000031 029 74816 542 9302 573 69646 007 84598 009 868119 331 61177 684 03176 534 90862 741 47375 707 59249 327 36243 300 81542 535 49382 672 03344 487 85678 368 088642 075 906147 706 439150 719 264199 858 4311 057 591 835424 182 4101 930 579 4002 678 530 1143 747 091 6742 387 887 8512 728 859 980551 007 144144 694 384203 287 93463 869 06728 579 3639 714 5915 984 69400510152025303540Phred quality score0G0.5G1G1.5G2G2.5G3G3.5G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

100 %247 479 381100 %0 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

100 %247 479 381100 %0 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %00 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

49.8 %123 279 84149.8 %50.2 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

100 %247 479 381100 %0 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

37.2 %91 980 27037.2 %62.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

48 117 2679 265 73210 819 04011 09037 567182 125 504762 6580510152025303540455055606570Phred quality score20M40M60M80M100M120M140M160M180M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped