European Genome-Phenome Archive

File Quality

File InformationEGAF00002869168

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

63 913 789133 699 406217 692 974294 760 288346 027 553361 307 015342 163 911298 194 603241 746 357184 074 872132 634 21691 122 99060 149 02438 390 06423 848 99214 536 2318 803 8635 373 7923 337 9022 151 7101 469 0361 063 772816 014658 778548 126471 471413 373366 827328 208293 641267 781241 692218 850196 962177 730162 440149 052135 456124 335114 141104 13097 10591 26383 28378 09672 51467 18762 50658 90555 10652 19149 65147 40845 15343 86941 70439 71938 42836 62935 53934 93233 05431 45130 21729 37728 55827 31126 43326 14124 91823 55022 81522 28921 31821 06220 09619 75819 06418 42118 08817 67017 16216 86916 18415 74015 14414 71014 61714 10413 58813 38713 41913 27213 14513 20812 73712 51712 25112 12011 54711 25611 11410 92710 91610 81210 47710 35710 41310 0599 80410 0549 9369 5339 3248 9888 9258 7288 6558 4928 5158 5338 2588 1258 3068 1137 9027 9987 8077 5577 3427 3647 3017 1026 8386 8476 7386 9946 8686 9736 7736 4156 7166 4086 4556 4606 3826 2406 0545 9815 6755 7695 8525 7646 0325 9625 4705 5165 5585 5355 3805 3365 2545 0445 0695 0324 9224 9484 7764 7294 8474 7614 8054 4874 3484 1674 3134 3424 2424 0714 1454 1974 0684 1924 1214 0144 0183 9813 7983 6753 7783 7183 7293 5763 5173 4883 5533 4893 4493 5603 4193 4673 3313 3593 3353 3143 3703 2253 2383 2943 2143 0923 1243 1593 1153 1853 1133 1153 1723 1393 1123 1653 0572 9162 8262 8892 7432 6532 7752 8262 6962 6712 6422 6312 6742 5942 5792 5652 3522 4242 5042 4192 4742 4102 4672 3122 2852 3442 2962 2992 3062 2852 3642 3552 2652 4092 1822 1162 1662 2802 2282 0992 0561 9631 9641 9801 8581 9761 9211 9921 9531 9411 8531 8361 7901 9161 8991 8221 8251 8021 8151 8141 8421 7651 7611 8991 7461 7261 7691 8151 8591 8451 7481 8281 8081 7251 7191 7051 7141 6931 6411 7161 7161 6481 5971 6441 5991 6951 6261 6411 5471 5831 5991 4861 5361 4691 4711 4461 4341 4971 4821 4521 5031 4671 4281 4211 3811 3331 3741 3481 4831 3901 3651 3881 3091 3291 3791 4021 2701 3421 2961 3041 3041 3431 3101 3081 2661 2921 2011 1681 1091 1781 2531 1651 2391 1671 2051 2611 2531 3171 1741 2321 2321 1871 2421 2431 1921 1851 2221 1721 1781 1861 2111 1641 2381 2001 2351 0631 1301 0881 0911 0661 1291 0891 1321 1031 1051 1601 1311 0491 0081 0211 0121 0239739649641 0209599861 0179839949529069249609579849659309819649401 0099809749599529209321 012944925927924985961972879902918962887860842899868869816815810837827857832774819848889798803866825814823839808806781752790791746719783728732737737771770758763815660734722688701679669668636648634646678701667676643674673688731687674664676683694668728668611655670681693633672604610595670641610616603571603560575589582570594559606596584634622611644632625618619608655678648617608627622666607591634608649575633629621593620576559567595509542572544552568573572562548554516551547568513593536522553531586583523498535519566507481531518503505512460518493478518505524517496476516448483491523558502557479463442458446437457478443423454463426450423456450473437431420436430428443431408434452437453470453468436452464468481424431385401385429443426436453448437409436415451457373392431461374414418396419437407405391421414400430388408424420444408382408394392428385422357381377385356383386382368377372403401361409356349405426368398391363377337354404399409363383343378382378330391360344357416375380373350362367355348358414348352373377379316361375336349367360309334371349341342326363332327336346355358332337323339364336340363360345307304335299308317355397339348354361368354331342367334328368342350323338341367328341353313331329352346336311353328357340338323328306344310318348339349351344344344321344303342375314332394337337355339347336302289312301280286280305314310316282281294296294303299278289278268271291284317338323333290287317341311317308302308273308326280285293280294331313325325348283301308307326297277298283294270264295296281340304270294333291284260285275294310303307313284294295318291255303300296291318330307305287283314276280329316292280300306277283268264297272284271305303318269263305296308277328296305317288294274275293259251274275234267284264270267263273 249100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 687 75800000007 598 559000699 680 957000000000417 840 9070000494 796 59900001 130 589 06400002 568 369 02200015 710 916 07400510152025303540Phred quality score0G2G4G6G8G10G12G14G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %138 898 09299.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.6 %138 771 06499.6 %0.4 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %127 0280.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %69 643 97050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.1 %136 624 19098.1 %1.9 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

8.3 %11 510 0878.3 %91.7 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 320 354123 35572 846149 766107 586116 547125 380181 21880 905136 56560 24153 86273 18288 09444 300106 45575 06885 658106 454152 766155 077154 263186 770152 350250 784419 03922 879724 27634 08033 81864 56873 04032 49987 99034 48434 79254 48176 89119 587116 9901 676 38279 19873 920132 260110 007213 035181 256291 968469 01948 46765 19057 84374 08232 62063 12765 65144 007187 87645 220101 444125 339 318051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.9%99.91%99.9%99.9%99.9%99.91%99.91%99.9%99.91%99.91%99.91%99.9%99.91%99.91%99.92%99.91%99.9%99.92%99.91%99.91%99.9%99.94%99.9%0.09%0.1%0.09%0.1%0.1%0.1%0.09%0.09%0.1%0.09%0.09%0.09%0.1%0.09%0.09%0.08%0.09%0.1%0.08%0.09%0.09%0.1%0.06%0.1%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped