European Genome-Phenome Archive

File Quality

File InformationEGAF00003613221

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

36 948 11527 180 38917 680 50611 172 4685 997 2913 822 4461 909 7851 543 983782 985772 642419 217459 179276 177305 529207 087213 724159 740163 347131 996131 101111 545111 121103 26399 06391 84585 46480 68780 34576 24271 91870 45069 07565 70663 43059 66459 05158 36354 78655 74054 02052 44451 14550 15049 27947 37746 47245 65846 33644 45343 97542 92041 45642 39140 41541 41640 09439 29439 06438 34438 30838 04136 98436 38734 91135 01934 11334 00233 49433 50233 00332 78532 14831 85832 05231 44430 62130 21930 64230 07630 10829 73929 30828 83228 48827 82627 63927 86028 12627 38027 62926 53126 05926 18226 50725 91625 51724 78725 21325 50324 89324 89125 11824 38024 50024 62923 89323 40723 22823 90323 80223 37623 25423 10022 89722 66722 60422 26222 50122 15722 04922 23921 18421 68021 84221 56621 39021 25521 34220 64121 16320 85220 43520 54520 14720 17619 71219 82119 48719 99619 34818 70519 44718 88718 71118 77418 86018 20918 71918 54418 69118 22418 49118 22218 36418 02217 87517 70718 15417 43717 38917 50417 02817 35317 04617 24416 68916 86316 91116 50516 45816 32216 13916 24716 06316 08115 77116 02915 91015 48315 72116 01415 97015 40315 76515 45615 54115 10515 20315 28015 39014 93614 97514 73914 59414 58214 63114 37514 36914 44914 16014 14414 31814 08313 62213 60013 95213 84213 73613 30213 62213 35113 18313 22713 39713 14713 25013 11212 88913 04812 74712 86412 72513 10312 62112 93012 61512 72812 64712 42212 80312 49712 40712 08112 17112 29512 15611 97811 68311 96712 02611 98411 62211 83811 56511 60611 48811 44211 52411 20711 45211 19011 41511 21111 31211 00710 98711 18211 05110 95010 93510 84411 08510 82810 60910 68210 50610 66310 25710 30410 24810 12710 21810 07010 09410 35910 1249 9909 73510 09610 0709 8159 5889 8639 5779 6889 7019 4859 4269 3879 5619 2579 1099 3079 3159 4029 3189 0629 0398 9509 0179 1238 7268 8628 7178 8418 7228 6758 6888 6508 6438 4548 5668 4218 3948 1768 4638 4078 2408 2898 2418 0658 3188 1118 0227 9928 0348 0197 8217 9867 7987 7437 7627 5347 7347 4697 3447 2607 4907 5157 1907 2267 2697 1557 1747 0757 0866 9437 1646 8876 9677 0266 8586 9046 9356 7276 7356 6626 5356 6736 6556 5596 5006 5686 5976 3436 3976 3246 1636 1956 1746 1466 1266 1056 1026 3086 3976 2226 1156 1285 9765 9705 8995 7485 8185 8475 9895 8535 7835 8035 7135 6775 6925 7995 6955 7095 5135 5195 7465 5925 5155 5225 3665 2795 2215 3365 2795 0515 2925 1574 7945 0304 8904 8695 0244 7504 8304 8504 9174 8534 7094 7394 6844 7574 7904 7804 7554 6944 7024 5514 7684 6344 5544 5874 6604 4894 4514 5144 4464 2574 3654 1384 1664 1154 0854 1244 1064 1244 1334 0883 9633 8953 9164 0013 8884 0703 8983 9073 8543 7793 7874 0053 7303 8313 7223 8133 7733 8243 8103 7913 6373 6553 5683 5233 6173 6243 4113 4883 3623 4293 4363 3393 3593 3213 2063 2893 1733 1883 1583 0573 1443 0993 2043 1453 1393 0723 0063 0273 0263 0152 9543 0523 0462 9402 9902 9323 0672 9162 9522 9462 8172 8092 9572 8462 8082 8942 8362 8052 8202 8182 7402 6922 7342 8742 7562 7092 7312 7272 6272 6152 5782 6122 5102 4542 6162 5732 6192 4602 4422 4692 4402 4492 5112 4702 4522 4422 3852 4832 4222 3572 3302 2992 1962 2132 2202 2102 2702 1162 1472 2272 1182 1492 1362 2142 0231 9852 0782 0451 9941 9781 9341 9441 8821 9471 9001 9651 8231 8581 9391 8921 8931 8061 8121 8151 8251 8201 8341 7591 7391 7301 8111 7711 7771 7631 7451 7291 7381 6421 6871 6871 5881 6501 6411 5751 5911 6291 6711 6261 6111 6061 6481 5381 6021 5351 5421 5221 5511 5131 5191 4771 5381 4561 5001 4981 4931 4321 4671 4911 4771 4501 4551 4541 4221 3401 3421 3201 5381 4191 4501 3961 3721 3201 3141 3561 4101 3631 3491 3191 3451 3681 1991 3011 2721 3021 2841 2661 1941 2571 2331 2331 2001 2791 2391 2781 2921 3041 2981 2191 2761 2411 2391 2221 2861 2461 1791 2231 2261 2211 1921 2391 2131 2231 1701 2021 1471 2071 2091 2001 2111 2291 1181 2251 1831 1641 1181 2021 1431 1481 1921 1911 2491 1351 2121 1041 1861 1701 1461 1931 1441 0851 0701 1301 1281 1181 1021 1641 1041 1581 1521 1201 1001 2061 1521 1321 1411 1191 0931 0131 0611 0851 0771 0189861 1181 1041 0301 1171 0711 0031 0451 0441 0471 1371 0731 0749901 0441 1371 0581 0661 0321 0821 1071 0621 0941 0891 0091 0879721 0479731 0091 00298999094498495592599197196686290793991795792096490095388694097494794299187990690793590289289087791188792291088588590987889584184286787384983388286381280785485382381778782682184574676280979375081879278678878474974075676072878481475974778276676475171476576277471170473172873375472976575879273573872572266069769673468772869365771065667169466465572666767964465360463664264459263563359958457567160362958061256356159458659059660357659457054554057057353555155955855355751352654157553952650752252252952749254551453152951152348952450048747549451949944248748149347849350244446746843247645843945341143846941745641242943940238136339040839638939438939437840540236738535740339138737137738536839338034539436036639836837537635331142 403100200300400500600700800900>1000Coverage value1k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

329 79700000006 474 565000118 646 53400000000072 439 342000083 809 0850000154 468 7950000314 419 9400001 176 995 94200510152025303540Phred quality score0G0.1G0.2G0.3G0.4G0.5G0.6G0.7G0.8G0.9G1G1.1G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %12 825 21899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %12 800 70099.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %24 5180.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %6 425 28050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.8 %12 696 27898.8 %1.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

63.3 %8 131 93663.3 %36.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

273 4853 1821 8675 2442 2622 2804 6103 7613 7155 3611 5201 3902 8832 5089384 4721 2481 4503 0543 3363 3795 6595 0703 1536 58714 60861955 7567789332 5872 2391 3834 1627899952 6062 0805905 83581 1743 6803 0065 8404 68910 32314 44310 72248 9062 7734 4124 0674 9212 5084 7654 5793 84917 9833 6777 16612 209 285051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M12M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.81%99.79%99.8%99.8%99.82%99.81%99.82%99.8%99.81%99.79%99.83%99.81%99.8%99.82%99.79%99.83%99.81%99.8%99.85%99.81%99.78%99.8%98.75%99.85%0.19%0.21%0.2%0.2%0.18%0.19%0.18%0.2%0.19%0.21%0.17%0.19%0.2%0.18%0.21%0.17%0.19%0.2%0.15%0.19%0.22%0.2%1.25%0.15%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped