European Genome-Phenome Archive

File Quality

File InformationEGAF00000852734

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

191 614 05155 893 84423 716 01913 095 0028 509 1456 118 8894 708 9803 769 1313 137 1592 671 3142 303 6352 026 4491 793 4631 611 3251 456 4031 326 4951 215 6921 120 3501 039 187971 014905 485844 562792 961746 619703 798666 086631 106599 203573 683546 184518 991498 097476 065456 805438 275421 476405 210389 927375 228363 971349 015339 163328 960319 289309 045299 298287 812280 273273 405265 275256 854250 254243 578237 497231 172225 279219 494213 281207 771202 859198 162193 431189 857184 905179 853175 948172 883169 359164 912162 414159 164156 040151 703150 278147 104144 210141 841138 639135 269132 565129 438126 949125 463123 060120 146117 545116 208114 945112 285110 450109 943107 303105 908104 047102 612100 00198 51197 47895 66193 89191 86690 60189 98387 63386 85185 26384 41782 91681 30480 18779 35178 13777 47475 31374 74873 60672 57471 65070 62769 67168 84867 79367 19565 86865 39664 47263 91362 84262 02061 05560 68560 13958 78158 18858 09056 40655 93155 76354 57654 13053 73653 02352 98052 38251 42650 64450 95149 90749 55249 16048 55448 08047 33146 82946 76646 01045 24444 72044 07343 60443 16742 84042 21442 04942 00641 32740 80740 49439 36539 77439 04938 55638 31137 52537 27336 88936 63336 50935 85735 92835 13535 23834 87234 59634 25333 87433 37233 07032 70132 96232 28432 26931 71231 72831 13830 74231 02030 63730 38429 91729 70529 57229 07828 99628 74728 45628 34728 02827 83527 72427 28027 07726 86126 29726 35326 14826 01625 45726 01525 33825 30025 14624 46525 09724 66524 39323 77923 63223 46923 13622 61922 92522 85022 53522 29322 33121 79821 92421 74521 77321 48921 50121 38021 03020 84620 43020 52920 28620 33819 93019 97219 68319 56519 22519 33819 29819 07018 57718 82318 74718 40818 07417 91317 97517 66617 77317 72117 25417 16317 10616 82717 07517 10016 94116 42516 70616 55916 20216 17415 96315 94915 95715 84315 46615 54815 41715 26415 42615 54015 15415 19514 83814 96414 69014 84514 50014 51214 40114 17914 10414 04513 93613 76913 80213 77813 63113 48313 47013 45913 34113 48713 36413 26513 28713 19612 93312 93512 56812 85612 74412 56212 52612 47512 18912 26712 29912 21712 00212 12212 09611 96411 77211 78711 57911 72511 66811 50411 47511 40211 45510 93511 03111 03610 91010 97911 09210 74010 70110 70210 61710 62210 53110 32910 52410 40010 17010 16210 09210 04510 13710 0009 9009 8389 9449 7629 9779 7149 7469 5979 4739 5869 5489 2439 4509 2759 2479 1289 1129 1019 1899 0728 9409 0108 5958 7108 6168 8368 4608 6508 5168 4028 5328 4128 4218 2568 1328 1408 2727 9978 0317 8298 1508 0817 9557 8557 8807 7417 8197 8057 5507 5037 4497 4537 3677 3697 3657 3707 3617 2147 1817 2667 1477 0897 0457 2416 9006 7896 8486 8766 8426 7616 5736 7456 6366 6966 5956 6266 4976 4336 4186 5176 4486 3206 5546 1686 3806 3706 3496 3076 2846 3036 3806 2006 0806 2056 2266 1406 0626 2306 1185 9756 0816 0306 0136 0375 8415 8495 9735 7215 9315 7855 6065 7935 7485 6255 8095 5855 7375 6485 4775 5385 5585 4705 3925 3325 3365 3275 2365 2605 0505 3115 1805 1365 0424 9635 1435 1195 0765 0475 0935 0994 9164 9174 8635 0724 8074 7764 8684 9124 7274 7654 6864 7614 7304 7484 7354 6464 6834 6624 6164 5514 4914 5034 4944 4544 5194 4554 4834 4394 5034 4174 4264 4684 2734 3644 3114 3684 3334 3304 1774 1454 2824 2124 2454 0594 1204 2144 0604 0444 0444 0504 1654 0523 9653 9504 0283 9194 0223 9873 9674 0594 0383 9903 9173 7563 8653 8893 9963 9213 9013 7913 7883 8363 8053 8693 7853 8363 8183 8083 6873 7823 8123 6823 7943 7723 7813 8063 5583 7373 6583 7483 6413 5603 5263 5443 5383 5103 5743 5083 5453 4103 3883 5363 4633 4393 4663 4323 4433 4323 4673 3153 2473 3883 4233 3793 3693 2943 2693 3413 3103 2423 2343 2273 1123 2133 0813 1683 0843 1713 1213 1793 1133 0443 0663 0683 0163 1153 0843 0993 1163 0133 0603 1012 9383 0432 9562 9753 0242 9622 9802 8542 8522 9612 8612 8752 9112 7602 8622 8902 8012 8182 8582 7622 7242 7732 7602 7032 7152 7342 7182 6582 6632 6702 6502 7662 7822 7472 6342 5952 6602 6022 7152 5292 6192 5632 6032 5662 5382 5882 6272 6052 5332 5472 5912 6002 4352 5392 4782 4732 4642 4732 4682 4332 3682 4532 4352 3702 3952 3932 3382 3362 3442 2902 3852 4702 2802 3762 3172 3142 2792 3432 3552 2252 2042 2032 2802 2432 3102 2132 2592 2842 2412 2362 1582 1702 2632 2532 1652 1462 1522 1552 2122 2142 1992 1532 0972 1582 1472 0932 0202 2122 0282 0812 0652 1062 1431 9632 0882 0122 0132 0282 0102 0142 0632 0372 0311 9022 0251 9611 9981 9131 9121 9491 9521 9071 9491 9461 9161 8951 9151 8771 9211 8691 8321 7931 8291 8501 8761 7801 7901 8501 8141 8151 8581 8841 8991 8991 7751 8181 7401 7531 8271 7901 8301 8511 7471 8031 7941 6901 8101 8251 7511 7511 7531 6861 7631 7521 7551 6991 7561 7431 7551 7311 6311 6621 6891 6831 6671 7121 7891 6831 6411 6781 6241 6071 5671 6421 6151 6401 6761 5651 5671 5971 5551 5801 5571 5291 6141 6241 5241 5721 5901 6251 5421 5481 5321 5401 5891 6001 5071 4731 5761 5811 5011 5771 5651 5111 6431 5521 4881 5671 5231 4191 5281 4661 4941 5121 4941 5051 4031 4941 4971 4761 4391 4311 4651 4831 3981 3971 4271 4491 4411 4471 4881 4291 4181 4291 3691 3291 3721 4791 3711 4251 3971 4701 3811 3771 3911 4341 3871 3731 4241 3971 4281 3901 3771 3921 4321 3261 3391 3501 3591 3501 3711 3311 3151 4211 3251 3201 3351 3301 3401 3041 2751 3101 3131 3201 3121 3471 2631 2631 3071 2471 2881 2381 2181 2921 2171 2231 2461 2761 2381 2221 1661 1841 2361 1971 2691 1941 2291 1991 1941 1941 2341 1731 2651 1981 1531 1681 1901 1861 2351 2021 1591 2551 2211 1211 2171 1961 1671 2021 1731 1961 1461 1601 1141 1361 1111 1341 1351 1041 0891 1191 1461 0971 0911 1111 1571 1041 072931 516100200300400500600700800900>1000Coverage value2k10k20k100k200k1M2M10M20M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 892 385000045 126 22521 431 00282 377 13276 044 74718 167 17938 955 52116 175 02320 624 96126 605 20911 587 26944 465 58529 347 72030 554 35953 256 01427 964 49660 563 60232 338 83453 730 59675 920 34472 752 28887 859 073132 528 347129 370 543118 303 736129 326 669257 610 499370 158 957257 548 307413 082 371554 007 776871 391 183437 877 754902 867 749760 445 8111 003 966 1331 244 320 3031 540 976 39800510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

94.6 %126 846 72694.6 %5.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

94.6 %126 846 72694.6 %5.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0 %00 %100 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %67 016 81450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

94.6 %126 846 72694.6 %5.4 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

29.8 %39 925 21629.8 %70.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

9 253 0846 804 98034 444 132107 138 234020406080100120140160180200220240Phred quality score10M20M30M40M50M60M70M80M90M100M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped