European Genome-Phenome Archive

File Quality

File InformationEGAF00008487349

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

32 298 81625 922 48316 483 16012 810 1886 481 0815 738 3922 837 8923 037 9071 628 6511 816 5131 107 9351 186 643834 081835 960619 232601 439472 548441 631355 160342 436287 635273 576228 775227 822191 078190 245164 806164 420141 314142 202125 363126 770111 057112 79198 71898 91991 43191 26680 79281 77575 31472 89070 75068 52563 05062 83559 70559 03956 46354 68053 05050 94749 38949 15845 76245 98743 56042 16741 42341 42738 77239 18937 08535 97135 26534 78633 40832 66631 37931 71630 20329 44628 99527 99227 86926 90525 88225 22225 10225 29824 46824 07623 38322 86022 77622 10221 98820 92320 62419 99519 73419 56319 87118 79918 70418 38818 01317 47617 21317 11116 59416 43216 43715 67116 02315 48615 28315 44414 94614 49213 71514 39614 16713 69413 56913 40413 29713 38512 90212 86212 51512 23512 18111 71812 14111 69011 58911 40811 24111 22310 78510 91010 88710 71910 37410 40510 24710 2149 6039 8609 8039 7119 5049 3989 3979 1079 0128 7948 7818 5258 6778 6548 7138 3298 2328 0888 1138 0347 8738 0557 6257 7347 4927 5157 4717 3127 3557 1167 0187 3027 1967 1886 9227 0346 5606 7576 6716 6776 5386 3156 1436 1976 1025 9816 0046 0725 8375 7645 8846 0335 5955 5555 5765 4365 3315 1665 5565 2115 3005 1894 9964 9435 0044 9744 9024 7654 9774 8324 6384 5674 4544 7504 3154 4194 3744 4144 3634 2954 2014 0483 9343 9874 0143 9233 8293 8933 9113 7603 8503 7043 6313 8153 7603 7043 5183 5113 5303 5983 5113 6233 5703 4973 4683 4503 2543 3823 2713 3293 2293 3213 2813 3443 2143 2603 2263 1003 1192 8582 9443 0882 9313 0542 8522 9262 9802 9122 7892 8642 9422 9032 8362 6692 7622 6912 8222 7852 6632 6552 6172 5552 5442 6102 6892 5742 5362 6232 4952 4932 4342 3682 3532 3102 2942 3892 3132 3122 2662 2982 3042 3552 2022 3352 2082 1992 1692 1232 0482 1142 0492 0992 1792 1472 1132 0492 0602 0362 0112 0532 0492 0491 9472 0311 9631 8951 7931 9081 8541 9071 9561 8131 8891 8201 7191 8061 7951 7511 7511 7421 7781 6851 7121 6831 6021 5821 6641 5641 7091 6951 6361 7231 6231 6171 6951 6451 5631 5801 6171 6151 6521 5501 5181 5611 5611 4571 5141 4451 5291 5021 4481 3841 3961 4141 3591 3611 3401 3891 4491 3421 3531 3071 3721 3691 3471 3671 3621 3111 2971 3191 2551 3211 2931 2521 2311 2931 2131 1851 2421 2371 2381 1451 2691 1711 1261 1851 1781 2011 1031 1741 0591 0511 0291 0681 0311 0311 0891 0711 0391 0841 0451 0991 0901 0221 0631 0491 0071 0569691 0761 0741 0251 0231 0109899969628961 0049711 0321 0039841 0169571 0139469039759419229119218941 001872963901868896909894888854877861837878822858808849838808859795798799873799807812798747796769735746782842740744801728824748744744726773728745726771751722797733712769714654733761709751708700687735703713700670660697688750678651753621624660626650612630639671620591637654616655627650602667604604611616594576586573569620579545576573536602524533559533497517629510544536508535536514499528517542517491488504522477547484510496489483498510434485469465447442486471476479480515441451461452475417478438447446464436453424419423422461433419444389458389403398405406445446417453434435441424425426428433417447462427449433477452418417390392418380457431404375404389413440359403401408385403426440361355357381399400377415392321375312376319361350342332365326354315358356328350329349339363345336332358328344309349335328333364357358382322352327363351367369376335326355341332333317338281357312295327337321319308287300299321296340310322283288284301278282230280260270290303278291296304276285258282274279277295295305262281281272280257249267296271312297313250255277299267250290232299276278249272287313272277245244284269237243234306260241251280243261301247262238230232226227259236249243235235255231242227259257234266209261253233229254247238229248248229246245226212212238241236202219232222220190199220203237190195226177202220188242239227195214215207207205176197213210187186188168189189193190204195189220161199193205203183203178146208186193183203193173186211211196195179199206192192222210205195173193174189198191162176177190173187194174185195204165181171186182171159148168183168170170197193173167177158160184156149163176179174176156184170209131184153173148136164155161162167658 146100200300400500600700800900>1000Coverage value2001k2k10k20k100k200k1M2M10M20M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

00444 48800000367 256000144 298 80613 577 89333 712 35944 639 71041 027 77806 448 630002 796 447103 301 4246 949 7928 561 00418 702 6668 647 930122 601 46416 612 67528 401 2684 589 98033 139 027245 152 12567 098 10373 485 50447 350 48677 831 197546 713 1511 620 467 803002 260 290 53400510152025303540Phred quality score0G0.2G0.4G0.6G0.8G1G1.2G1.4G1.6G1.8G2G2.2G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.8 %44 531 12199.8 %0.2 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.7 %44 470 20099.7 %0.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %60 9210.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %22 308 83850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98 %43 725 99098 %2 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

82.9 %37 006 42982.9 %17.1 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

711 03622 02314 29930 81715 63517 43522 86522 61816 26234 47711 51010 96014 87214 2295 97534 5649 54415 03218 58916 3099 55726 97012 47514 42022 35734 2487 408161 6818 1698 14913 90814 3177 04427 5008 5959 75714 03914 7056 22652 188391 38516 17523 21621 11426 74745 97648 17755 753109 948175 10117 36034 54519 24347 53117 47718 66219 84578 68732 59346 40642 076 315051015202530354045505560Phred quality score5M10M15M20M25M30M35M40M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.85%99.86%99.86%99.88%99.88%99.88%99.88%99.88%99.87%99.83%99.86%99.89%99.85%99.86%99.88%99.87%99.85%99.81%99.86%99.87%99.83%99.81%99.86%0.13%0.15%0.14%0.14%0.12%0.12%0.12%0.12%0.12%0.13%0.17%0.14%0.11%0.15%0.14%0.12%0.13%0.15%0.19%0.14%0.13%0.17%0.19%0.14%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped