European Genome-Phenome Archive

File Quality

File InformationEGAF00004840340

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

55 434 845106 911 175167 857 864227 637 473274 900 317301 646 217305 708 031289 397 889258 255 403218 835 344177 402 611138 146 553103 870 08575 800 11753 857 68537 400 01025 503 79617 093 19011 326 3947 453 1514 920 0053 260 0212 203 9421 528 9141 087 189804 559615 690495 190410 987351 249305 454272 805244 517222 490203 884185 188172 854162 993152 060142 108132 413122 801114 809107 735100 14095 72889 11384 80578 29975 66770 11366 00761 98059 00055 70953 36350 53648 31845 39242 81441 18139 31937 23335 32034 35433 37332 17330 95530 39428 63627 76326 76325 48225 13923 81523 21323 05322 01721 26320 77520 24419 88218 90018 25917 53616 67116 90916 06815 78815 40515 36614 43414 14513 77913 75213 33712 76612 74412 78212 66512 41912 11311 55711 47911 43011 38210 84410 85210 36310 63710 2599 8219 6269 7409 4259 0868 8568 5948 7338 7308 5908 4168 4208 5148 3058 2468 0918 1118 0617 8477 6857 6697 3927 5447 3357 3337 0447 1746 9036 5816 5206 5926 3976 3346 3206 3666 2726 3276 1926 0905 9396 0095 8895 8335 8555 7985 8125 7375 6685 7045 5825 3055 2955 3055 2225 0124 8355 0775 0094 8804 7634 7854 9534 8714 7764 7874 7884 5024 7044 6954 4954 5384 4194 4884 4594 2573 8744 0463 8953 8333 8553 8733 8203 7953 9623 9203 7113 7223 8273 6533 7303 7873 7023 8353 7313 5683 5453 4233 5623 3963 3623 2713 2673 3103 1933 1313 0343 0602 9062 8892 8632 8092 9162 7942 8862 9633 0122 7702 9272 6872 8232 6922 7682 6412 6882 5172 5412 6192 5442 5092 4922 4102 4042 4312 3722 3752 3042 1542 3622 3342 4282 3462 3242 3202 2512 2272 2652 1822 1542 1952 2042 2212 1282 1052 1422 0851 9872 2092 1982 2292 0951 9652 0281 9031 8201 9221 8791 7911 8781 8341 8771 8731 8901 8471 7841 9081 7771 8521 8151 8861 9061 8731 7861 8501 8051 7841 8631 7421 6901 7321 7221 7231 5671 6191 6001 6361 6961 6621 6321 6491 6901 7981 7161 6691 6221 6611 5691 6121 5511 4891 5141 3811 4161 4471 4901 5551 5131 5021 4621 3171 4171 4261 3921 3971 3451 4601 3951 4271 3821 4101 4021 3481 3541 3291 3301 2681 2911 2721 2611 3171 2921 2431 2061 2571 2611 2421 2241 2461 3991 2771 2531 1991 2611 2331 2601 2451 3101 3381 3131 2031 2691 1831 2141 2241 2471 2371 2161 1951 2021 2401 2111 1231 1011 1151 1331 0651 1511 1201 1351 1141 0761 0581 1261 0851 0911 0421 0851 0731 0291 0851 0321 1111 0971 0401 0781 0841 1001 0471 1011 0421 0051 0321 0001 0391 0179899449949799381 0069949519479509899589819361 004952935937909920871898913912844901878880842972829913821827867862905841840833820848866868833880877832875923823894923803907842843859838850821802827818780777744789814749770799738743790766729804756716802933759762806760757724765705778775763767770743727706726688654721755742677712713716727742739789747770710707740702733660674690678706682678687683753715669657663655682665651635684674634676624672670624648598600657619630624623633609617603569576557592605611592608563609575532511509531572507537545517510565530529501524507491516519486534510516466500510498533542497534561513547549532521504527502527529528510512490524528499524516524553518541478500443436475429441424459473490485437460461468460437468471448467443491484478412466472465442476466442440436424464456440425435440441441474441420429404456415490418436474438479502457449479469436460446446418415421424408375419438406413410468406405394432449458421406387388386395430399394426385362330380390381388402369374395377348366379453388336372385396379355374394383402386371379361338351384386347368336330347346379361350363399360386361408407398386354404355361337374354368359349390337367334354317317341330345340330326368347329335360381324349387363361361375341343349337342340341325322368321354334357359364319360359344359358345322311304330322320294297276331295281281293331299318291297289300299287305362307297359325320332320355298314313296324302313323293305291312310282303356328353356316357320358296341325354342312345356327315310316293297341293288313335327327323361312313347306366355306333327349341330342338343339351356358330370342337351339350374390351359363351300337321323303321315332318297305288308328317311313338324328316313295300357346296305314314316321329319318358331328288307305321323283 858100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 114 889000000068 328 4620001 019 120 261000000000620 410 1510000663 900 34500001 459 466 95800002 939 224 92200017 221 303 51000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.3 %157 708 10899.3 %0.7 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.1 %157 519 41299.1 %0.9 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %188 6960.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %79 449 89950 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.5 %154 848 39897.5 %2.5 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

15.3 %24 292 14215.3 %84.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 914 920140 41986 900168 796123 660130 545146 261200 90796 719159 63373 99564 30885 32199 61653 649120 71385 61495 821120 696170 069173 719173 120215 014169 263269 066434 89029 424762 24742 56339 93074 52982 50238 651103 30241 67741 90463 99588 07124 986135 7161 847 64893 48690 597152 898128 882238 832208 272319 432497 96062 48978 11770 12886 41443 44189 75380 56058 863206 67264 178125 842143 569 693051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.88%99.85%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.88%99.89%99.88%99.88%99.89%99.88%99.88%99.89%99.88%99.88%99.88%99.91%99.85%0.12%0.15%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.12%0.11%0.12%0.12%0.11%0.12%0.12%0.11%0.12%0.12%0.12%0.09%0.15%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped