European Genome-Phenome Archive

File Quality

File InformationEGAF00006164797

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 954 3934 476 3036 775 3819 898 51913 914 90218 769 02624 701 75331 588 73639 656 13948 842 27459 019 72069 987 13081 547 94993 136 873104 423 587114 938 182124 149 803131 707 256137 304 756140 539 369141 618 025140 426 080137 160 944131 948 356125 269 002117 264 293108 415 70899 079 48789 381 66379 825 69770 534 17761 693 14953 456 09845 893 04139 050 34932 959 47727 638 34522 994 39618 984 39415 597 81212 730 64310 357 2918 388 6506 786 8425 470 4864 418 8193 560 9522 870 2052 337 0601 895 7211 555 3611 290 2801 079 205908 288775 418671 223583 359514 849460 763411 974379 086348 525318 804296 727275 254257 944240 880227 301214 782204 259193 304184 000174 430167 628158 981151 805144 592139 345133 455126 809121 540116 772111 525106 352102 87098 68594 41491 04788 35184 68282 38178 91577 60375 96073 57271 02969 93168 13266 05764 75664 06361 42060 70160 12658 70358 59257 22855 50754 60854 49952 34151 75251 70849 64149 21448 31747 22946 78645 76844 92944 26342 65242 54542 48340 91940 23938 88938 46138 08436 82836 39235 72334 96833 61833 34032 40331 39330 53629 87728 58328 31427 79827 15226 92425 73124 53924 80724 30423 48722 91921 89521 85421 32220 62119 73419 70819 43219 38018 36518 33217 70017 79517 51717 01516 66016 45616 28915 74215 68715 68515 66315 18115 00514 55514 37614 11013 73313 68613 60613 36713 39513 07012 78412 49612 06212 21012 36511 75311 52511 32211 56111 16910 98310 94610 86510 65110 54110 38910 27410 1249 8969 9509 7119 6229 3169 2829 3599 0469 0108 9798 9298 7368 5768 4458 3928 4668 4318 3008 2318 1928 1567 8327 8348 0107 7317 5167 8837 5007 3557 4637 1717 0146 9066 9646 9926 9566 7056 5626 6936 6606 4776 4976 5906 5906 5396 6486 7356 3306 0436 2405 9165 8595 9945 8776 0045 8895 6535 7215 5995 6925 5055 5765 3355 4855 4465 4255 4055 5405 1885 3705 2935 0235 0335 0074 9324 8144 7684 8214 8934 8944 6264 5754 7004 4924 4694 4594 5034 3684 4594 4684 4134 3834 2054 3414 3114 4274 3454 2714 2964 1954 3614 0564 1524 2104 2744 2474 1374 1014 1394 0443 9703 9083 9403 8543 7923 8233 8253 9683 8923 7563 8253 6523 5983 7023 6383 6133 6023 4203 5413 5623 3533 4033 4573 4903 5393 3743 4893 4173 4233 3543 2793 3213 2543 3223 2703 2743 3213 3713 1833 1633 1823 1513 0653 1703 2063 1463 1033 0873 2583 0143 1323 0583 0563 0843 1642 9283 0143 0072 8582 9212 7982 8692 7932 9402 8682 7762 7172 8452 8432 7742 7892 8172 6702 7652 6842 6612 6632 7602 7052 6912 6692 6652 6542 6042 6152 6242 5312 5672 6172 4892 6612 4902 5662 3752 3852 4772 5092 4182 4782 3932 3562 4162 3072 3092 3132 2962 4722 2982 2742 2022 1622 1282 0582 1172 1252 0812 2832 1832 1722 0642 0412 1712 1181 9861 9752 0532 0252 1292 1462 0822 0352 0501 9471 9842 0481 9341 8101 9321 7981 9021 9691 9351 8401 8661 8641 8661 8471 7521 8351 8101 7981 7691 8821 8661 7471 8011 8261 8631 8491 8021 8971 9101 6941 9221 8281 7811 7631 7691 7841 8311 8141 7481 6971 6721 6441 7101 6901 5591 6071 6401 6191 6441 6011 6531 6381 6121 6781 6011 6461 5111 5301 5471 6271 5781 6031 5451 5441 5991 5801 6051 6261 5541 6471 5581 6211 6261 6741 6081 5061 5011 5031 4571 5121 4601 4071 4291 3971 3961 3891 4381 4351 5031 4481 3911 2891 4101 3461 3731 3161 3261 3021 2611 3461 2381 3721 3021 3461 3691 3361 3481 3851 4481 4901 3311 3981 3881 4731 4411 3081 3611 3521 4581 4681 4041 3681 4231 3091 3071 3411 2441 2511 2451 2581 2221 1501 2111 1381 2581 1811 2041 2191 2691 1401 2671 2821 2151 1701 2191 1941 1771 3501 2061 1611 1741 1981 1781 2301 1831 1921 2001 2001 1711 2011 1781 2101 0881 1311 1491 0191 1351 1101 0191 0891 1391 1251 1239921 0441 1379609849541 1051 0339851 0379911 0031 0441 0151 0181 0351 0131 0281 0371 0051 0471 0959861 0151 0731 1061 0281 0481 0621 0199639479869209979149229399498939049538888749369459049259139669701 0129829421 0009431 003953900950980989945956931926881856923872867891897844886904812818832820825837828823829785839918814790841822869898830874897899795805905869840824784821792804781823874831889857782890839871902862840837806809800799758717762757802766748819805779732769739739696745790783744744759787737756779744729727779713712757751683715713693709713700673680714729758728694732755778737711717722739728782704713740720741720655689656673676635694663738677676688700684621683652649718685710706673690722660603646655615660623629664642590645617700610614634698649716633658658616670665634619681620622628608674607679623626605673609664587609590558605601682665624571602618623585580579624602620567628640595588560627565652600617603580604628548528628513643546569564608572561587572555567566557475569556571578545516522523490532505501483502507478485528511504494429457460496468472482496436514509451481507512484508502508504569546505484432475449428437446463462445440456430464417456427447495441439479602503497496486452395451451467670 332100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

365 67100000000002 508 368 43400000000000003 875 962 6060000000000062 437 251 27900000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G50G55G60G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %455 332 09099.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %454 966 70499.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %365 3860.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %227 887 24550 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.2 %442 863 02297.2 %2.8 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

43.8 %199 813 00443.8 %56.2 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

20 452 080539 823271 5281 585 532450 236408 858561 008706 355289 975495 298220 357214 544251 923323 015368 027345 492253 592334 134348 887507 947543 604513 447662 536483 286758 3661 284 97278 0742 242 534116 027110 427212 953239 357111 751283 342118 855117 990182 945264 01067 694391 8755 477 143250 979224 472411 959335 917642 194547 733871 5421 445 217149 432209 895183 707235 88698 744194 178203 709132 668589 333136 474307 771409 176 226051015202530354045505560Phred quality score50M100M150M200M250M300M350M400M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.92%99.89%99.93%99.93%99.93%99.93%99.92%99.93%99.92%99.92%99.92%99.93%99.93%99.92%99.92%99.92%99.91%99.92%99.89%99.91%99.92%99.92%99.95%99.78%0.08%0.11%0.07%0.07%0.07%0.07%0.08%0.07%0.08%0.08%0.08%0.07%0.07%0.08%0.08%0.08%0.09%0.08%0.11%0.09%0.08%0.08%0.05%0.22%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped