European Genome-Phenome Archive

File Quality

File InformationEGAF00000726525

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

179 865 54664 815 16220 437 2509 410 4004 222 7452 648 4991 645 0671 180 676866 898666 773525 783422 459347 940292 820249 318214 301185 752162 961141 778125 950111 57299 40888 01081 06673 48667 44461 12556 58251 63648 46444 67442 01539 29636 01834 79232 21030 80628 84127 36625 28023 94822 84821 71120 67419 96418 96918 21417 76717 00216 41715 59915 27415 06114 38513 85013 10912 97112 81812 01211 78311 34111 21010 77710 58610 0849 8589 8999 8169 3159 1909 0668 8368 5908 6098 3688 2108 1258 0067 7537 6947 6847 6957 4117 3287 2597 0167 0126 9266 8556 7356 5906 5766 3666 2556 3406 0456 0875 9796 0366 0705 8675 8635 9135 7075 7095 5575 5255 4055 4385 4365 3275 2725 1595 2685 2575 2275 0935 2215 0655 0225 0254 7544 8674 9004 6314 8664 7404 6644 6274 6174 5424 4764 5004 4404 4304 3974 4304 2624 2774 2744 1984 1844 2134 1354 1464 1524 1664 1514 1104 0454 0744 0924 1544 0294 1124 1884 0433 9693 9154 0243 9043 8653 9553 9023 8343 6853 8373 7653 7653 6003 7213 7173 7963 7773 7583 6353 6043 6533 6893 6773 6253 6963 7713 6503 5643 5273 4753 5763 5003 6503 5213 5253 5793 4533 5113 4943 3843 4643 4283 3073 2613 4003 3733 2013 3203 2493 2113 2173 2443 1653 3023 1863 2703 2293 1463 2093 2033 1033 0633 2543 1513 1543 1763 1353 0033 0362 9823 0852 9913 0032 9412 9312 8612 8812 9062 8862 8622 8392 7572 8612 8382 8522 9072 7782 8182 8492 8472 8352 8932 8262 7672 7302 8402 8302 7442 8192 7892 7552 7682 7092 6982 6552 7372 6762 7222 6792 6272 6972 6952 6932 6942 7532 7222 6162 6692 7212 6282 6582 6962 5822 5432 6952 5372 6262 6012 5482 5842 5452 6742 5432 6112 5172 5922 4792 5962 5562 5302 4772 5642 4862 6552 4432 5622 5202 5212 4602 5962 6112 4442 4242 4022 5072 4732 3732 4942 5252 3612 3852 3982 2792 3532 4802 3472 3642 3212 3542 3232 4432 3112 3262 3392 3332 2582 2552 4022 3172 3702 2222 2522 1852 2242 3322 2512 2322 3562 2902 1972 2292 2892 2292 1672 2462 0972 1532 1652 1402 1042 2042 1442 1282 1232 1102 0452 0022 1792 1032 0842 1002 0532 1042 1412 2542 1012 1482 1572 0622 2012 1762 1262 0952 0852 0872 1242 0572 0322 0792 0951 9822 0172 0702 0292 1241 9802 0221 9631 9462 0501 9951 9822 0181 9191 9661 9381 9471 9611 9411 9211 9661 8621 9321 9111 8641 9181 9611 9471 9191 9251 8871 8921 9331 8711 8721 8071 9441 8341 8711 8211 9271 8671 8571 9601 9141 8401 9121 9361 9681 8141 8521 9651 8801 8221 8661 8421 8591 8701 8271 8041 8751 8631 8691 8071 9001 8281 9251 7311 8501 9261 8231 7931 7921 7941 8161 7521 8781 7771 7201 7671 8331 7041 7271 7211 7361 7381 6861 7191 7571 7161 7061 7401 6921 7531 6811 6751 7061 6421 5891 7151 7231 6481 6811 7061 5941 6571 6041 6491 6491 6201 6191 6401 4931 5871 5921 5491 6001 6031 5991 6611 5461 6531 5271 5721 5611 5851 5831 5881 5191 4811 5941 5431 4991 5131 6031 5641 5011 5491 4441 4931 5941 4811 5881 4281 4851 4811 4971 4821 4281 4061 4391 4611 3891 5361 3981 4421 4891 5191 4101 3751 4271 3431 3831 4301 3691 4201 3581 3581 3731 2871 3471 3561 3811 3891 3981 3751 2461 3541 2791 3161 3451 3541 3331 2751 2541 2401 3201 3181 3271 2771 2761 2741 2161 2381 2451 1801 2301 2621 1781 1591 2431 2341 1991 1851 2271 2321 1461 1691 2101 1971 1061 1491 1381 1061 1551 1331 0941 0931 0631 1291 1171 1111 0611 0621 0451 1031 1131 0761 0431 0841 0941 0621 1091 0431 0791 0561 0569641 0119961 0329531 0011 0081 0029821 0009521 0049629479769179389559809701 00890794893689788488586793587689989388789993083487888386984086289283183487688781280883482784986080379380783084679679878880278082475876272479576171774777174978178175574173877369879272175576568370773970767469072872772667176575167968874968471169070672971165069971362466064465565364861266168766362765964464458963865762364269364768167657864962164262857465661760858859260059358261261359160156657060657460359558356259960052856655753861954257061556161555658057659754459152756055356050053253748357351251754853651752350750253754254353154156150549851749246252655750350152252850251750751650851944947543848346547248744944842945841445344041342241541838740744839839743337440939538338339536737939039036036036536935334035438338834235433837334532532134432433529930130932033732831229326129230328326428325229327527323525424727127125223821524725824722027021924124125626724222422322223520020222420519423221520922423522221620020222320419819720820819417120220617619221919122815918718318516420919420217716018815915317518116716216115317117517215113813114612516012112213614111710911012110610111417 800100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

366 27900001 693 7551 064 1295 582 9224 704 5351 778 1882 748 7261 060 525997 3901 955 292785 8712 252 3741 972 8822 141 7104 031 0262 359 7353 066 6342 100 7463 896 0665 318 1446 011 1096 664 78710 721 63210 868 6189 903 20610 710 40122 865 91834 948 44322 599 66835 262 53760 988 55780 075 95648 501 055111 678 30092 890 911130 826 028162 704 940253 701 55500510152025303540Phred quality score0M20M40M60M80M100M120M140M160M180M200M220M240M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

98.9 %15 322 21598.9 %1.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.6 %15 271 74898.6 %1.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %50 4670.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %7 745 33750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.3 %15 231 94098.3 %1.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

19.8 %3 062 34819.8 %80.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

717 3392 5171 3124 1961 9452 4002 0523 2803 05218 76213 2225 02824 3655 7413 92365 3506 82228 43522 6366 70138 5411 24232 376116 1711 0257 5449137691 060575 7781 9551 5541 7522 2581 6022 73069 177247 5708 1442 5869 7627 7702 02016 1723 8205 31858 9408 6189 59412 11219 9488 87025 38619 00627 69249 24491 46813 061 109051015202530354045505560Phred quality score1M2M3M4M5M6M7M8M9M10M11M12M13M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped