European Genome-Phenome Archive

File Quality

File InformationEGAF00000658137

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

89 660 53632 071 7569 525 9384 315 1501 821 3121 171 087674 803492 715347 738261 202200 941161 398133 407110 59293 74479 27168 97060 86953 91448 47242 84139 64036 63933 80830 74228 55225 47224 93623 25121 81520 91020 07519 04618 25117 79616 85916 22515 83215 20614 64114 16714 15513 68113 17312 98212 86212 39812 47911 90912 09011 81911 72811 37511 05210 62610 68910 47710 26410 1759 9249 7349 7299 6039 6349 3919 1439 0818 9378 7548 7128 7868 7128 4958 2908 2558 1638 1387 9797 9177 8417 9057 8277 6727 6467 6607 5377 4227 2037 3607 1667 0757 0647 1566 7766 8806 7716 6606 5866 5366 5356 5896 5286 4226 6156 4866 3686 3596 3836 2176 2096 1106 0505 9665 9806 0706 0005 7855 8175 8365 9685 9565 8115 8145 7855 6625 6565 6155 5615 6955 5215 5625 5945 5815 5445 4925 3435 3985 3445 2605 2815 3725 1795 1315 1225 0965 0894 9935 0505 0275 1694 9654 8714 8714 8074 8124 8524 7064 8834 8454 7534 7174 7264 6924 6804 7234 8004 5494 6364 6264 6384 4774 4144 5514 5044 5074 4574 3634 4074 4214 3134 4064 3354 4394 2454 3144 3444 2044 3544 3234 1774 2964 2104 2424 1494 3244 1964 1114 0924 0814 1284 0204 0534 1054 0333 9613 9633 9073 8923 9873 6793 8933 7843 6833 7943 6893 6983 6723 6123 5453 5723 3873 3693 3503 4493 3483 2693 3053 2383 1753 1303 2293 2433 1303 2923 1573 1903 1703 1163 0693 0472 9622 9972 9482 8782 9912 9442 9562 8672 9912 9802 9672 9902 9392 9352 8252 7982 7832 7502 7192 7712 7112 6352 7192 5352 5162 6212 4922 4342 3572 3332 3592 3562 3012 3462 3592 2902 2872 2262 2252 2032 1442 2002 0832 1582 1022 0882 1292 0872 0782 0582 0032 0222 1402 0502 0322 0591 9781 9451 8551 9491 9431 8531 8561 7721 8431 8411 6961 7641 7621 7491 6531 7111 6721 6781 7111 7231 6081 6901 6371 5621 5701 6411 5931 5441 5531 6021 5711 5971 5071 4901 5071 5541 5271 4721 5011 3171 3851 4061 2811 4121 3591 3371 2751 3781 3161 2961 3161 2831 3391 3311 2521 2661 2421 2281 2391 1711 1701 1691 1621 1211 1411 1641 1961 0861 1561 1421 0131 0511 0451 0819871 0901 0551 0321 0931 0501 0319519859539369619489369058728838788798919148298848428178298388357797968417467477507347467717257377127247527447397587587687287356847076246226666046326346215975725485926196075785965685615355535585365095195055105225015074854474674444314684354134414384334144294134184124233914003653943293723283343483443463323203333373163183232862703152903122702972962532472562552532192352372362602342332171992251961681941911732061801781831661591761551701571421541531831421471511711471601331371171221191281261191341121281221091021011051219996120104118939180888883877410091859397888487968185927996749195959095928586877692787980677372577369736654585657465151635647574736434155434236494344403939274030393954253332283323252634242527253427282727213321283637242625183228222832293323303121332825252627272923272547243018232629302015242323202422211720222121281920211526311930271833342626352427243026242521232321202319211625102116211313229191519172117162622221823172021122214161692215189241420127171882320157915111218101213151712912161610131211615151113101197613779109128119851015136611910131468107121096108968148611117136610931312641112161191271281339987138596871111121353118951181689138811910711910267411734856879461089697109359610671281117641691171211131371111365997141112465345843457569751481177712914871110736106111 840100200300400500600700800900>1000Coverage value101001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

976 2730000671 285441 0792 342 8691 968 095691 9341 127 308467 643541 452911 098336 4471 379 850949 1321 104 1321 634 594869 6481 395 185896 3551 679 6712 304 5732 394 4902 847 4144 471 8584 738 5474 090 5484 320 7269 861 41114 730 8809 903 36115 331 39127 933 65437 526 33021 277 80451 789 80340 023 28661 317 93772 553 520118 457 21700510152025303540Phred quality score0M10M20M30M40M50M60M70M80M90M100M110M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %6 944 02099 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.6 %6 917 89698.6 %1.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.4 %26 1240.4 %99.6 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %3 508 39250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.3 %6 894 49098.3 %1.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

21 %1 475 68221 %79 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

300 6411 1045631 8649711 1229201 4921 5308 4015 5242 59610 2402 5941 75927 9133 18511 9598 8262 74415 92756113 78350 6355183 609434424462258 3149607746509847281 15031 790120 5593 2161 0324 0663 5109427 1481 6702 46426 4563 9404 4385 3408 7804 01411 0788 88012 28421 38640 0225 947 908051015202530354045505560Phred quality score0.5M1M1.5M2M2.5M3M3.5M4M4.5M5M5.5M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped