European Genome-Phenome Archive

File Quality

File InformationEGAF00002340508

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

33 268 56771 898 610127 572 446194 360 028261 158 137313 379 251339 423 868335 099 342304 412 684256 425 956201 784 682149 445 244104 850 96070 033 62744 859 89527 767 55516 715 0779 884 8335 839 5513 488 9192 160 8151 419 680986 433740 019584 058485 071415 311364 289325 755290 671264 200240 990220 135204 923190 517178 407166 698155 411143 917134 137127 811120 884111 937105 05399 51092 49684 84479 90974 68571 29567 10461 35757 48354 19051 44648 19145 40242 13240 28138 91636 91335 48934 55332 90532 23330 66229 64328 23527 60527 17925 29225 00023 98423 98323 56022 72621 58220 45520 71520 45919 18418 85518 65018 33417 55417 14417 03116 26716 10315 39215 56115 01514 88613 97714 17413 99213 29612 98213 00512 83912 39611 84911 89711 80411 49611 07511 29310 99010 88910 60310 76810 58310 64910 39110 18810 0859 5549 7259 3269 4019 3149 1759 1349 0508 7728 5748 6758 4788 3028 5348 0168 0287 8377 8167 7917 6407 6938 0067 5917 1707 2647 4507 0107 0726 8776 8746 8846 9156 4876 7126 0366 0985 9736 0476 1335 9396 0715 8635 9295 8415 6555 6085 5065 7445 5935 3935 3625 2805 1745 0585 0724 9714 8814 8244 7514 6644 6414 8254 5794 5944 4894 5754 4824 5684 5304 2514 0594 1704 2444 0434 2374 0354 0593 9433 8353 8833 9433 9153 8003 6563 6583 5663 5673 5583 6723 4683 4253 3483 6683 6043 5323 3993 5133 3393 4413 2823 2623 2453 2283 2603 2953 2583 0623 0323 1963 0823 1823 1583 0233 0182 9562 7762 9452 8132 7862 7602 8312 7702 8642 7122 7262 7332 8832 6232 5822 7292 5982 5192 3872 5752 4242 4252 4262 4582 3882 4542 4002 3762 3792 4882 4502 2192 2752 2152 1572 1152 2132 1402 1882 1772 2182 0572 1052 0642 1221 9822 0311 9281 9791 9771 9771 9642 0891 9651 9361 9121 8261 8671 8561 8231 9761 8981 7611 8451 8351 7921 7521 8481 7511 7341 7301 6831 7351 6671 7351 7431 7431 6401 6591 6781 7011 6451 6071 6021 5781 6631 6821 6171 6061 7361 6591 5721 6171 5711 5201 5411 5431 4761 4991 5281 4181 4831 4261 4981 4371 4131 2931 3421 3861 3351 3541 4081 3211 2491 2841 3321 4111 3611 2401 3591 2741 3581 3321 3631 2651 3271 3961 3101 2641 2451 1701 3211 3441 2841 3071 2041 2491 2401 1551 1801 1101 2001 2311 1741 1741 2111 1341 2041 1451 1641 2261 1591 1751 2181 1601 1651 1211 1181 1141 1241 0891 1051 0991 0151 0531 1211 1251 1251 0461 0539881 1111 0951 0321 0351 0701 0211 0691 0441 1281 0531 0461 0451 1141 1221 0141 0201 0251 0351 0141 0189971 0159941 0019451 0471 0089511 0069941 0731 0219629849809939931 042971947922966907934932937939944906895871905866896929904873871886877863892913866791852870919814844867792885827778839820794790822792778793739831836876877895808833812829795879847822839764776800764766664720748789756766668711687733691669705652729692721713702706694709726726719710704723632658619681668665697696698735651693718682672630715665616666629667597668625627631666650675659602613583588557582566564561538589569575591587585595574600544545554538583594591563559542589506536552570557535546568533555573590617618599598635609583604598567573591570590594551594559593561533560586595605571514521557562538531555565549587544556489495477477480530471463503479456490468512485469456432492448468487499451462447466419473366445414463417465437416419404398447408392432414441401433401441462439412418419416391450419415422403409428426428397402385412415397401387417409412388418413380410396405395407371391348373385382396362362374344383342368411384357356358378385399348343349345363383395402341364384330333361312335384327360333353379351331358358358323318332333343314346313349346304323338312348338342356331368351357322341352336347292308351332297331325352327339335329333332355313330320312327332375314336339319349323297376325309353357347347331307344313327357318330341301339320310310285286333313340350307300312310311332337329310346353369319321283298305315300302281338332305309290317280319300289308283284290298270315271270284292275296299294313295333329316305321295305338304293330297282275276286302274280292250296339262266277270250272261268278275260249271243276233273265284272256297284253276243243246272261263262250266253247243292272282254260282265236245251259246228246251238247261251244252225257237263262269268238246256263 758100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

14 147 770000000082 261 3940001 149 196 191000000000660 758 5560000764 928 91900001 589 765 80800003 239 419 02500017 009 277 59700510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %161 757 33199.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %161 297 74299.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %459 5890.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %81 158 13050 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %159 098 98898 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.9 %12 775 9857.9 %92.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 388 947147 48886 909174 350128 899138 255156 434215 51690 339161 98969 71162 18691 663101 11650 767120 20082 73391 757123 166163 534170 972165 110191 125151 151247 681425 44526 202764 27539 03938 35486 72081 42435 62796 44239 75438 85768 83087 87222 528132 1702 110 58391 41291 562142 914122 227226 130192 107301 537505 10256 69074 03865 29287 16542 66580 89079 04256 625208 92059 013116 083146 497 570051015202530354045505560Phred quality score20M40M60M80M100M120M140M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.7%99.7%99.7%99.71%99.71%99.71%99.71%99.71%99.7%99.7%99.7%99.71%99.71%99.71%99.7%99.73%99.71%99.7%99.73%99.69%99.72%99.71%99.8%99.6%0.3%0.3%0.3%0.29%0.29%0.29%0.29%0.29%0.3%0.3%0.3%0.29%0.29%0.29%0.3%0.27%0.29%0.3%0.27%0.31%0.28%0.29%0.2%0.4%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped