European Genome-Phenome Archive

File Quality

File InformationEGAF00004841146

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

50 379 358102 210 080169 415 239239 596 273297 751 876331 449 917335 664 351312 715 532270 834 399220 016 926168 933 323123 397 33686 294 07858 127 58337 863 54124 015 50814 989 8849 236 8725 698 8783 596 1602 333 1181 584 6181 139 488864 968691 544575 081491 903427 968379 497335 695303 689275 830252 527231 447213 404198 725182 693169 163156 904148 296139 306130 872122 920115 028108 15799 96393 75887 27681 32977 71172 13668 24764 89160 40257 46854 12452 27949 51046 89843 88342 27039 49538 74137 15535 00734 63633 14331 57330 27330 66929 48727 99226 90425 61425 42525 00424 33523 30522 86822 60621 61021 20720 29419 78119 21918 21618 41217 63016 74316 86016 00215 75915 27315 08914 71214 74814 13613 79813 54713 23613 04612 50412 05411 76011 71911 36811 08311 04011 23310 42410 28210 2489 9379 9719 8039 8859 4649 6319 3209 2159 0498 8548 7788 3778 1148 1357 7557 6247 3737 4857 2317 3737 2997 0657 0466 9986 9387 1086 6346 6596 4386 5496 5076 3186 1836 2966 2676 1536 1395 9575 9986 1506 0545 8025 6765 6685 5005 5235 3495 3335 4545 3795 4735 3785 1645 0535 0594 9404 9024 8814 7764 6334 4594 4064 4934 6774 5424 3234 4704 2444 3164 4244 4454 2004 2884 0334 0884 2404 1523 9794 0853 8933 6773 7193 7283 6703 8013 6723 5693 4943 5223 4803 4543 4083 2403 2863 3583 3703 3343 2323 2253 2603 1103 1723 0363 1463 0372 9462 9052 9632 9342 8622 8772 7692 7052 7962 9232 9142 7482 8432 7332 6132 7102 6622 7502 5652 6742 6422 7502 5122 4742 5982 5252 4842 4532 4522 3532 4082 3752 3262 4072 3942 2412 2702 2812 2232 4062 2962 2812 3122 2482 1802 1862 1392 1252 1312 1912 0732 0612 0091 9621 9241 9061 9081 9691 9751 9371 9341 9711 9761 8911 9841 9541 8701 9241 8481 8121 7841 7431 7291 7111 7571 6331 7211 7361 7281 7351 7081 7401 7381 7611 6651 7471 6711 6311 6411 6891 6051 5221 4521 5531 4581 4611 4541 4831 3781 4781 5181 4871 4581 4321 5451 4291 4101 3521 5131 4691 4111 3551 3381 3421 2911 2991 3401 4031 4701 3741 4141 3611 3371 3811 3251 3641 3561 3471 2601 2851 2761 2791 2931 3711 2911 2681 2621 2341 2131 2511 2781 1721 1371 2101 1801 1691 1021 0861 1481 1671 0781 0881 1151 0461 1331 0701 0131 0411 0471 0001 0721 0071 0261 0511 0671 0801 0771 0961 1309681 0481 0611 0669751 0831 0859761 0419861 0521 0071 0209819909411 0299571 0081 0019649479859599261 0061 0419911 0721 012978985922945944903886988868908863887848852837819769818890902878871856885823799802816817829782792772840783745780794815781829777746763765786818778789706714752811756726755764771755742772716744805753792748732737720691681728723751769774729700738717745749718733714743733739740753737765720733692728735731721663682703704672672654713691690708688640688638655713674692654633655640683625634627630632608637600623635599616641616556595584635550579568548524540549603588578575530539548523551544528530519527581488533524528574544483507511533514518503486518478520496499526481533517522502539505504516499505461477491476477462495446406482474484480477447492446462465474465456433478430384394407386422403414412419435458447461440415451454428386440464473478434468440420454414407429447461430520434394467429413417399420425439442421442431458466495515435424422462426457483413427375429410423405383389387394412374381409412425394414402386450419379378397463389371378404412353384388385356378386367384386389371388358353389368354349334393373395375373329345346342358336323351342319299347337340374332331369338340324332292315327353319310319333337332337309313365301333353365355333320312295332345332313336328311342327340335334351335338316364315297289322299328281285289300298330308308285266304310299262289314300267268300246267297256293272304261252246265275254272276249279247272272278286274286288282285274254273309272306275270266279277254270229267258251269267295225232269303293280288251265248244253252264277229291265279271259281334284285261234276261284272273262289272274240258274249273280274264247235238253274254248272253248236250249225236278288260261261234227242234228240220191237242248227246226258225249252243225248206209227251249227234252263265248234245215236249210244213220213229236256229244218299 105100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 287 016000000024 541 067000811 635 629000000000473 994 1230000537 314 86200001 253 618 94400002 748 199 62300017 181 195 85000510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.6 %151 976 45299.6 %0.4 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.5 %151 726 12699.5 %0.5 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.2 %250 3260.2 %99.8 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %76 267 50750 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.7 %149 076 67497.7 %2.3 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.2 %18 643 97312.2 %87.8 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

7 006 812136 53282 548163 403120 761128 322143 436193 75892 034148 72968 20359 68384 22996 12149 117115 19682 20395 105120 304166 039173 852167 396210 712160 545258 080423 03826 002746 35637 07936 20773 88776 69837 70694 18438 41338 60662 11883 36522 071124 9031 871 55687 03579 666142 564118 021220 811198 709275 007515 86352 16773 26360 07982 84537 38071 78870 06349 507200 63750 147107 490137 328 441051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M130M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.83%99.82%99.83%99.83%99.83%99.83%99.83%99.83%99.83%99.83%99.83%99.84%99.83%99.83%99.83%99.85%99.84%99.83%99.85%99.83%99.83%99.83%99.89%99.83%0.17%0.18%0.17%0.17%0.17%0.17%0.17%0.17%0.17%0.17%0.17%0.16%0.17%0.17%0.17%0.15%0.16%0.17%0.15%0.17%0.17%0.17%0.11%0.17%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped