European Genome-Phenome Archive

File Quality

File InformationEGAF00000655902

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

67 448 20831 866 63111 540 9795 470 2312 243 3781 476 149803 162602 182394 997310 140232 233186 003148 619124 999102 80688 82476 63066 94560 33752 79247 50543 04938 62536 49533 23430 08628 96526 58225 14023 48322 59021 32820 35919 48718 52617 33616 96315 81615 71615 11314 38114 15513 79913 50913 39713 06612 62012 44311 95411 86311 70111 37711 17410 85410 77610 84510 4109 92510 29710 0369 7209 5609 4529 3579 1468 9559 2279 0788 8788 7468 6848 7098 6908 4368 1088 0528 2078 0047 8807 7117 8917 8107 5197 4097 2167 4227 2477 1697 1067 1226 9696 7156 7846 9736 6716 5886 6696 6176 3956 4766 2566 3156 3316 0676 2856 0076 1116 1176 0576 1155 9246 0046 0005 7225 7905 7385 7815 6775 5785 5865 5965 5165 5025 6295 4735 3895 4005 2785 3255 2375 3465 1205 1995 3145 0284 9954 9425 0234 9905 0604 9414 9324 9214 8474 9284 8954 7994 9114 9244 8284 8054 7534 7894 6634 7714 6244 6194 7024 7264 5144 7474 4854 5744 4504 5044 3284 5064 5024 5424 4974 6074 4894 4524 3974 4584 4784 3674 4104 3474 3024 2864 2344 3624 2874 3134 2184 0774 1794 0634 0684 0534 1893 9654 1984 0403 9443 9983 9664 0643 8064 0403 8673 9203 9543 9653 8693 9973 8303 7853 8323 8493 6623 6893 7283 6673 7873 7033 6683 6203 5503 5473 6103 4333 5223 4733 5863 4513 4653 4793 5323 3713 3573 4243 4253 2643 2603 3413 3063 1623 2303 2503 1373 2503 2933 2133 2233 0943 1393 1663 1323 0723 0803 0012 9682 9822 9812 9902 9452 9182 8742 8382 9122 8642 7972 8982 8132 7842 7582 6572 7842 6322 5682 5762 5642 6032 4852 3962 4672 3432 3612 4482 4082 4392 3772 3412 3792 3212 3022 3162 3142 2252 2682 1932 2772 2302 2942 2542 2242 1202 2052 1812 1272 0702 0642 1652 0271 9981 9832 0232 0211 9571 9532 0151 9101 9631 9511 9141 9111 8071 8321 7701 7701 8131 7571 7221 7251 7841 6991 6861 6831 6901 6431 6131 6291 5661 5451 5781 5721 5591 4991 4961 4771 4631 4201 4001 4211 4651 4101 4191 4021 3701 3781 3411 3291 2541 3231 3271 3171 2341 3391 3131 2271 2471 1721 2721 1981 2241 1621 2311 2461 1301 1471 1321 1661 1691 1671 1121 1691 2321 0711 1451 2181 1481 1151 1001 0961 1141 0881 1241 1581 0811 0011 1041 0511 0341 0651 0061 0151 0599901 017982931929949948978986911927949922896874905845869869805814853789835794808796697721762728719733718733772713717720748725721735702732668671654670693673674641598631573643677632577566621641585572545557537526521551574512505529466463471494442428427461414432418444407405363375409393388408412339372395351336362364330350348313373341341324324292297301261299280310266283271291283285279268255270247262257253245281243226193275256220239216214232218227196186196198213222185196197177208168197170162190169180200182188152169151163155146143147128159162146137126125139150146137138150137129150139132126125118138136124129118941161061111051181137999938086831038872987597839984758295747110083857981737861887158686856685855536145544256565445415541675743485743465340373257496245413549373344444241393442455043403741304235293238443439444140323832363228393430292619282719241831131739232726252020222321241516142115202522281719202328162414182415322526202025182419222315112413212920221518202218211414121611131611151620161918201919141610161716161420139911157911721108211124149106141219201520122010111917121019912151622132719152311121411172017171316132016171817131511101510131198111010101351268121116191716241311202010101313119101312612111214971091210810912971487119869915876116689511121011116111110115410510122959479746615168645995937998654871149651277144151792 832100200300400500600700800900>1000Coverage value101001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

88 1290000596 268323 8691 800 9191 346 524467 693830 906316 950339 123618 408251 201778 980595 988699 8281 245 176666 0881 116 075700 1021 277 8031 807 3141 924 6002 178 2743 564 3533 518 4393 230 8543 585 5107 859 15112 716 8167 833 52813 746 54227 574 07838 453 74119 735 82353 291 85236 520 93563 660 04573 993 985156 944 08000510152025303540Phred quality score0M20M40M60M80M100M120M140M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.2 %7 223 73399.2 %0.8 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.9 %7 204 09298.9 %1.1 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %19 6410.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %3 641 33350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.5 %7 172 07298.5 %1.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

30.3 %2 203 85530.3 %69.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

301 4221 1606202 0929621 1621 0061 5191 6318 7105 9992 64010 5122 6791 90131 0073 42211 2039 6323 08317 00266416 07558 5466063 335491430526242 8361 0297848169628781 33233 726128 3583 8861 3144 3743 5361 0447 6541 9942 51829 0764 0364 4685 9309 0664 29412 0389 53812 97623 28042 0966 188 790051015202530354045505560Phred quality score0.5M1M1.5M2M2.5M3M3.5M4M4.5M5M5.5M6M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped