European Genome-Phenome Archive

File Quality

File InformationEGAF00008414375

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

2 472 9173 231 0314 839 6477 489 97311 073 93915 578 73321 020 12427 408 79234 972 31444 189 17555 319 31068 695 23084 238 829101 372 544119 164 373136 438 831151 965 285164 437 475173 024 458177 182 336176 564 363171 557 652162 676 870150 638 766136 411 903120 966 462105 182 42289 752 70375 207 04561 959 62250 206 75540 134 08631 596 91724 593 96218 928 45514 428 08610 930 7948 207 8776 149 8994 610 0083 477 2142 639 9532 024 8241 589 7781 274 5911 043 742880 670761 778667 044597 889540 549498 523466 117436 353411 330388 249369 096350 770333 350316 667302 205290 184275 711264 050253 040241 019230 168221 156211 030203 199194 246185 043178 400171 429165 619158 913151 572144 752138 879132 432126 873122 018118 012113 239109 221105 030102 12098 03595 29091 94789 04987 66184 10581 46779 10076 97574 02872 30570 92269 88167 51766 82064 85963 08462 37262 24961 08660 49659 20257 78657 33156 47655 39054 45453 90952 96952 95251 62451 45750 10949 21047 88246 74445 07644 34743 03942 85541 15840 28539 45438 61737 63336 28535 89734 21033 85633 01031 99830 93730 30229 18328 41927 35726 88925 71525 19224 52823 90123 36522 41521 65221 51220 97520 18419 75119 32019 36518 86118 21017 79817 04716 80416 65816 28915 86215 99915 67615 48014 77314 52814 28614 22213 35013 16112 97613 06613 05012 90612 78912 44812 24312 06511 88511 62211 51911 46811 33711 13910 98510 82910 82510 57910 52710 57710 32210 33310 2159 6599 8859 5349 4839 3659 2929 2218 8629 0398 9058 9968 9968 7748 6198 5378 3988 1588 2258 0808 1227 9358 0177 8917 7037 6047 6027 4137 3817 2897 5277 2027 2607 0456 7946 8096 9536 7476 6736 5006 5276 4026 3416 2976 3076 3316 2146 0626 0305 9415 7516 0755 8165 9475 8605 7945 5665 6155 6125 4605 4125 4245 2595 1805 1685 1904 9324 9374 9874 9694 9924 8674 9834 9404 8144 6474 6974 5594 6364 6654 6384 5984 5804 5724 5304 5124 4654 5204 3884 3734 3524 2824 2464 3074 2424 1774 3074 0264 2264 2183 9974 0644 0533 9614 1164 0383 9123 9523 8973 7193 6833 6893 6793 7073 7143 6253 7633 5383 7023 5823 6313 4893 7013 4453 5203 4213 5463 4803 5133 4243 3133 3423 3583 4933 3093 4653 4823 4383 5063 5123 4133 3283 2733 2883 2933 3673 4303 2083 2133 3753 2513 3693 3343 2543 0463 0632 9793 0743 1743 1633 0982 9482 9372 9902 8672 8712 9562 8732 9273 0462 8012 9382 8852 7862 7732 8512 8542 7372 7282 6732 8062 8782 7132 6462 6182 5272 6422 5932 6292 6382 6722 6032 7252 7322 5792 5402 7022 5852 5412 5222 5042 3792 4802 5622 3012 4162 3462 3712 3812 4002 3042 3352 3662 2882 2232 1682 2202 2622 1802 1182 2002 2712 2572 2062 2772 2182 1882 1752 2372 2112 1832 1392 1632 1102 1142 2502 0852 0772 1572 1802 0712 1212 0292 0862 0842 0652 0642 0182 0432 0241 9911 9782 0212 0332 0802 0811 9481 9921 9281 8811 9441 8721 8621 9921 8971 8931 9151 8511 8981 8581 7611 8071 8051 8281 8231 8471 8211 8461 8451 8931 8511 7261 7431 6901 7811 7421 6551 7131 7211 7451 6971 7261 6421 7031 5961 6861 6981 6551 7061 6721 6621 5711 6451 6721 6141 7201 6961 5961 6821 6331 5721 5901 6001 5861 5871 6851 6861 5781 5581 6131 6101 6231 5661 5831 5221 4431 5181 5601 5611 5171 4571 4371 4771 4431 4641 4671 4041 4321 4791 4381 5161 3571 4611 3981 4131 4131 3691 4501 4091 4471 4541 5001 3601 4551 3411 4031 4161 3461 3361 3941 3531 2961 3491 3201 3351 2791 3331 3011 3461 3671 3131 2861 3181 3021 2231 3081 1981 2851 2561 3261 3121 2271 2021 2691 2801 3101 2271 2231 1661 2141 1701 2301 1861 1661 2201 2181 2311 1301 1761 1611 1071 1431 1251 1431 1451 1181 0901 1081 1351 0971 1151 1321 0991 0861 1411 0931 1151 0231 0341 0241 0921 1031 0861 0371 0231 1231 0161 0771 0351 0541 0381 0011 1061 0081 0231 0201 0941 0369871 0789981 0031 0111 0279931 0251 0501 0481 0481 0351 0081 0109391 0131 0029851 0811 0431 0469989561 0191 0249351 0379541 0079909539911 0789579319879029589859849669399139769749339319269529819569251 0269639399579189129119669409379899579329158879319298648679308909079469239449741 019982902984932895941844887950902896849873857793846866831816861901896885913806858815798897862808862807856800807848832747831867769845806770770766738797737777770763730748757787784775755702757803799785832838808775767809770767790819813784836770808783788781822781772784832815786817771774776770801756759759705738691762779742765757733705721726761738687736794707760715692702715688766680708716721653661706678702674690685682722715649736734719707711665686629688663662664680676682717626620665634578612616606589568616591590594594642561579576593548543569614604599595633609579595596635549593587627638623577632649575569557593578562594596579569553531551574584572527526540522569544574563604543557569573559570569580560575572518564531510560520498529538545551575547521555560595535567547535527517570547592608520578512538615553517500524506525562517504517500503523500488519510488480483436683 471100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

800 61900000000001 782 846 08600000000000002 896 710 5890000000000060 152 071 33800000510152025303540Phred quality score0G5G10G15G20G25G30G35G40G45G50G55G60G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %428 700 15099.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %428 324 84699.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %375 3040.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %214 676 91650 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.8 %419 771 08497.8 %2.2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

20.5 %88 231 90920.5 %79.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

20 479 230421 353258 334498 899440 365385 714426 397699 031287 895472 389209 548179 618232 042281 936154 580347 369264 528306 958347 703526 226572 291515 545626 736465 952727 2051 219 02276 4892 229 208113 382107 509208 679237 999115 320284 017114 797111 898171 799251 56965 647396 2605 410 785245 260212 986397 496313 501612 935537 133802 0231 459 151139 027200 995173 943224 86190 166180 347184 428120 253570 119122 624286 257384 728 702051015202530354045505560Phred quality score50M100M150M200M250M300M350M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.91%99.89%99.92%99.92%99.92%99.92%99.92%99.92%99.91%99.91%99.91%99.91%99.92%99.92%99.91%99.91%99.91%99.92%99.9%99.91%99.91%99.91%99.93%99.82%0.09%0.11%0.08%0.08%0.08%0.08%0.08%0.08%0.09%0.09%0.09%0.09%0.08%0.08%0.09%0.09%0.09%0.08%0.1%0.09%0.09%0.09%0.07%0.18%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped