European Genome-Phenome Archive

File Quality

File InformationEGAF00005191000

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

5 819 0829 223 10618 137 71735 982 42266 289 178110 282 836164 502 418221 432 609270 426 186301 982 597310 779 450296 514 550264 070 658220 653 784173 896 782129 814 59892 161 07962 513 37740 698 40525 577 44115 646 4219 393 2595 621 9643 425 5202 174 8791 471 4841 069 182833 049686 391596 291526 315476 852429 824394 105363 783337 109312 276288 693270 405249 482232 367216 092200 205186 697174 770164 939154 690147 478139 019133 050125 679119 114112 626106 784102 42798 88792 94090 07386 03081 77478 26574 41870 70667 29864 64562 26759 04355 42053 71651 20148 49347 28544 87444 03241 62540 17939 38737 05435 35234 21733 62732 20830 87429 82729 16428 41227 91226 89126 46525 47524 52724 53823 86322 75322 48322 30221 67221 01620 25019 85019 27719 13317 99017 93517 43417 35717 03916 25115 95415 97615 67014 98914 30914 36214 41314 17013 46713 24913 06212 68512 60312 08011 98911 57611 46211 47911 22210 83610 64310 26710 27910 1499 9179 7549 6029 6919 4789 2109 1188 7198 7578 6298 6748 3738 4198 2307 9798 0847 7867 6237 5347 5087 2867 1337 1677 2866 9887 0056 9706 9326 9546 7426 7446 6326 3606 3876 2926 3516 3236 2136 0195 9666 1735 9566 0736 0655 9195 9105 7085 5985 4985 4225 5215 1424 9775 1535 2735 1334 8854 8854 7334 6454 7224 6784 6464 5324 5534 5044 2984 3804 2534 4014 4854 5044 2464 4494 4054 4534 2924 2244 2324 1424 1243 8983 8293 8263 9933 9144 0014 0403 7874 0653 9983 8063 7853 6873 7243 5423 5653 5783 5873 5233 5243 4603 4743 5043 4863 3393 4203 5443 3803 2423 2573 1913 1723 1123 2082 9662 9412 9402 9222 8993 0142 8802 9312 8652 9372 7052 7372 7582 7282 6382 6922 6612 6212 6372 6382 6932 6882 5272 4902 4792 5652 5682 5312 6322 4902 4952 5312 4532 4822 3702 3442 2872 2822 1962 2032 2452 1712 1232 2222 1822 1042 2022 2132 2302 2952 1992 2412 1702 1892 1452 0272 1562 1002 0141 9672 0202 1272 0022 0581 9252 0212 0152 0532 0051 9501 9142 0551 9712 0321 9951 9581 8351 9841 9551 8761 9471 9131 8201 6871 8401 8801 6931 7311 6851 6711 6391 6261 6871 8121 6721 6391 7051 6881 7381 6461 7291 6201 6021 7061 7231 6561 6251 6281 6141 5081 5541 5401 4601 4891 4101 4361 5041 4671 3901 4661 4561 5071 5191 5121 4911 3941 4331 4091 3981 3791 3671 4331 3801 3761 4161 3981 3181 3591 2511 2891 2811 2831 3581 2711 3441 2231 2631 2971 2991 3221 3201 3701 2361 2081 3111 2131 2441 2381 1911 1861 2121 2741 2011 2671 1931 2051 1641 1661 1441 1161 1921 1511 1791 1561 0871 1521 0861 0651 1061 0411 1789891 0239671 0291 0501 0341 0481 0221 0541 0761 1981 0731 1141 0309801 0301 0181 0021 0101 0221 0569189961 0051 0361 1231 0781 0979881 0531 0091 013934948976932972871885923926954859858857862877889852846890858818899861884895875881901960864878908883885867782856824808828933863808795848809859814801785837754746765811798758789753781780766744824742753727694715706739708739764780778766863788746739760733735737800757783740795674687753735715755772743787737716750768743728719793793748734748786760748736768799729771729723781714726715812720714752770756676675716733746707685707695679669668657679719710679721658667661690633652726702709668693638607602643588626625614630614653563579602575599580590544594563616612597605575642621577573565533529586544520567566561569579574625609556511563509508451524516551450462476453449500454491473466453474451496464471450440469468486471460459412455450435477459415416441434433440444497468438441432428425438493453448439418390424421458519444418405450437464443423453477443424420422430441432447421460446433472453456473435454449448441443448445403444442379485430426400383395390410436406365418413420424395387405386400380408360365358404401373372385401399428389384391348371376375360355401351408352374401392385356385351368378358389367377369391370384372376364357353372396365443403390359344379373379340377407358396380403379383364355396372350378359335406406383340365359401353381339327335353347346341339360336326341353294362360335323353343336309352296315310314295283299274297294283291280294282285246299248292268285282273277275239251255256261248276311255257242250256290222250243256245286253233269290262256248258276287306265283290292313296279269306272234273276275261277298268262286256269228268271289275263237241268234260264211236233236259222232235216234226220258225226347 324100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

453 74000000000001 564 461 73700000000000002 227 511 8920000000000030 917 523 75700000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G18G20G22G24G26G28G30G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %229 403 67899.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %228 997 75099.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %405 9280.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %114 933 61350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %225 394 29098.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

3.1 %7 130 0103.1 %96.9 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

8 146 617219 440130 941292 082197 568206 329232 894373 369134 992232 519103 77790 933129 630145 87488 317173 639116 770126 655172 796240 624266 581247 277277 953223 727354 871579 99838 5011 072 83858 49357 986120 658120 46252 167140 41659 06658 19096 939130 36933 539192 6302 965 629124 885116 700198 485173 085313 794271 638433 029669 23874 463102 08790 108117 42658 963106 872106 93075 636284 65177 099158 063208 142 240051015202530354045505560Phred quality score20M40M60M80M100M120M140M160M180M200M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.82%99.81%99.84%99.84%99.83%99.83%99.82%99.83%99.82%99.81%99.82%99.82%99.84%99.82%99.81%99.81%99.78%99.83%99.76%99.79%99.81%99.83%99.54%99.33%0.18%0.19%0.16%0.16%0.17%0.17%0.18%0.17%0.18%0.19%0.18%0.18%0.16%0.18%0.19%0.19%0.22%0.17%0.24%0.21%0.19%0.17%0.46%0.67%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped