European Genome-Phenome Archive

File Quality

File InformationEGAF00004199174

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

52 993 931129 121 424235 039 261338 928 720405 503 260416 150 004375 568 870303 644 570223 096 616150 900 93694 920 81556 140 35131 590 61117 103 4279 121 3074 913 9152 782 9711 710 9811 162 891869 531699 977580 875498 528438 687385 238344 423310 163281 306256 421235 388219 351200 634183 727169 907156 960144 631134 713126 543116 370107 51799 97992 53086 04480 38574 36567 87563 87659 48458 11954 45550 99148 29846 52543 32041 32040 02738 88737 20635 64433 99132 41731 66530 82529 26328 25427 68726 53926 08325 34524 27923 06322 98322 32621 70721 36920 59520 29019 72519 13917 97317 98317 52817 14316 62615 85915 88215 07915 00715 04414 27714 19514 12813 52713 14613 11812 51512 47812 02412 21411 86411 36611 64611 53611 06710 88210 56910 74710 52410 4169 9599 6009 3919 0229 0129 4909 1858 7158 5528 4798 1678 1327 9327 5507 6317 3057 2896 9646 8846 9766 9396 9106 7386 8506 6186 3656 4856 2586 5216 1746 1295 8345 9166 0415 8785 7275 5835 5575 5695 5845 8495 4915 4305 2565 2245 1175 0675 2505 1585 1154 8155 0044 6984 6724 4564 5794 5564 5144 5214 4794 4114 3764 2694 2374 0564 1524 2014 1734 3794 2734 2204 2534 1684 1674 1443 9773 9223 8863 8323 8883 7023 7093 7483 6243 6853 5863 5273 3823 4083 2493 2863 1983 1033 1473 1123 1193 0603 2123 1003 1303 0013 2573 0902 9992 8722 9113 0872 9782 9682 9573 0002 9732 8692 8402 8282 8382 8542 7972 8302 7322 8212 7482 6732 7282 6832 7932 5392 5672 5542 3922 4942 5102 4722 3702 3912 5232 3972 4042 1952 2382 3552 2552 3612 2052 2302 2212 2872 1992 1572 1742 1982 0522 0762 1192 0402 0792 0192 0401 9581 9141 9731 9211 9381 8591 9821 8501 9291 8411 8721 8181 8421 9331 9171 7851 8531 8451 8601 9451 8671 7011 7651 7181 7921 7581 6521 6811 7361 6961 7051 6851 6681 5801 5891 5561 5091 5501 5491 5891 5941 5951 5731 5981 5611 5331 4921 5141 5431 5581 5161 4711 4211 4821 4231 3681 3121 3251 3521 3491 2811 2921 3291 2421 2501 2771 2511 1911 1971 1651 1871 2831 1791 1581 1441 1331 0971 1521 1971 1211 1771 1361 2061 2381 2271 1731 2821 2121 2411 2511 1601 1531 1311 1661 2161 2391 2071 2251 1201 3101 2641 2051 1891 2101 1861 1251 0771 0941 1111 1221 1181 0961 0701 0511 0691 0621 0651 0531 0841 1871 1031 1171 1241 1581 2081 1231 1721 1351 1881 2251 1561 1531 1731 1301 2121 2281 1771 2091 1091 1051 0531 0661 1119859789789151 020980981916945900955873929960910924881955878862888923920872914906888930882919881891905851839794806756766786770780729725708747761692719700722719710759700733755763724691668681637679613638620586589583592591576612615596571665576558566537599609596603594625586561613545641600575561584600583572621609601579590607587565526584567544589566563565595575564578523581556597573595538522552552550562558544531526531528549519513574549542519521526513549495527513483517484544520535497495568524549546506474531540492514492497510497489509435484464464477513490478457510483454451450468446463482501478452400489437465500468469499468482498505455518447544476428432432467491421468417439453428420400404448450402411438469479418428391436401438408405378429410403422406358431340386407388359386378386309398407426398379379348384344406403424399410383409349382402378343331335350369349348344346313306311341329322358356396359340321330275358293302305312278291286287278308296295294279287301292259267261248265271237229254265252280258296286295252226271295260258266268268263222266251244270239265235283227279235233268244283281250267272265243268291313259262270279286274272273205269210266220225250211250249230262273257252237220269261237242224237229222232244260221227219241248232214229227269204242213216236217268201217238212211200208220206220202207215226218215221223205206212193228215185228237240223225218239200232211215246236255266268200229223215184211203220232233190200230217226233205216199227216216243233223204238202202197229198214197197217228224228197210241236200227230240255237236203204200228220177217183214185215199207183194179214176209201209191201224220195180188175213177196165181205185180181173198172189194202165199180162166200158199213202215205153208204182209182187176196164198196215189172178233197226 065100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 550 258000000061 685 1550001 218 081 301000000000661 708 2610000705 584 32700001 418 472 53200002 874 805 19400012 433 083 25600510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G10G11G12G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %127 887 19999.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %127 494 21299.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %392 9870.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %64 158 84250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.9 %125 679 24297.9 %2.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

4.8 %6 106 0424.8 %95.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

5 379 051136 32586 994159 457120 549124 264148 123188 44083 084144 68166 81657 28086 57090 85449 812108 72472 06978 313114 113148 074165 280149 595167 699129 460206 138326 66325 202610 34237 27336 02576 56270 90331 53087 11935 57135 01462 50874 15921 032116 5321 820 51578 39579 808121 783107 747187 936163 852253 210365 21854 71463 97959 18573 47741 94969 06871 30255 026171 27660 455105 946114 848 305051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.68%99.67%99.68%99.68%99.68%99.68%99.68%99.68%99.68%99.68%99.68%99.68%99.68%99.68%99.67%99.71%99.69%99.67%99.72%99.67%99.68%99.69%99.51%99.6%0.32%0.33%0.32%0.32%0.32%0.32%0.32%0.32%0.32%0.32%0.32%0.32%0.32%0.32%0.33%0.29%0.31%0.33%0.28%0.33%0.32%0.31%0.49%0.4%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped