European Genome-Phenome Archive

File Quality

File InformationEGAF00004841094

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

63 618 906129 727 393208 522 524282 182 636333 165 530351 633 874337 749 265298 855 178246 636 441191 531 067141 037 73299 125 98266 966 03143 730 56327 756 76317 264 63710 620 0446 504 7124 031 2792 585 6301 724 1851 214 304905 910711 774581 009486 933420 764368 840327 115294 300268 340243 425224 143207 071191 760176 167164 910154 090143 863133 425122 299116 837106 88699 63391 73586 09880 52974 95170 54866 64962 94658 68056 22052 54849 21747 47344 83342 12240 43540 08638 31836 18034 71433 44432 04931 91230 60829 97429 07827 23126 07525 46324 18923 72223 20922 18221 92421 12720 66020 10219 40418 87618 07117 87017 47216 87516 27615 59615 36315 12314 62514 19413 84913 43212 82712 71512 45812 50912 19711 93911 55111 53811 15111 12810 73010 48910 43510 21510 50610 08210 1039 7269 2279 1579 2238 8488 9108 7268 5528 7918 3648 3748 2127 9437 7427 9407 4417 4707 4827 3927 0807 2337 0847 2496 9766 8006 7876 5856 3936 2976 1266 0576 1896 0665 9976 0165 8185 8145 7005 5805 4965 4855 4025 3205 2245 2665 1205 2505 1525 3425 0264 7914 8534 8484 6834 6774 6724 4774 5254 5584 4984 4644 4544 3724 3644 2924 2444 2284 0294 1424 1604 0604 0814 2484 1083 8393 9083 6903 8033 7313 7723 7263 6653 5453 6793 4843 4553 5573 3633 4573 1953 1923 3593 3363 3043 2483 3103 1473 0983 1603 0362 9233 0733 0033 1162 9552 9672 8842 8262 7772 7942 8802 9382 8052 7382 6552 7712 7182 7882 7332 7662 6282 5762 6262 4992 4892 5172 4992 4642 4252 5962 4622 4462 3562 3502 3712 3342 3022 2322 2032 3112 2372 2092 4082 1192 1132 1142 0821 9672 0562 0522 0092 1182 0902 0492 0532 0992 0312 0471 9532 0041 9471 9151 8601 9161 9301 8011 8971 8411 9271 7961 8251 7461 8201 7091 7331 7461 6591 7951 7461 7521 6101 6781 6421 6581 5891 6421 5931 5511 4911 5571 5001 5591 4251 4891 4971 5491 5121 5321 5321 4931 5291 5561 5121 4861 4581 5241 3841 4791 4191 5131 4921 4431 5051 4661 4291 4341 3351 2251 2721 2281 2781 1931 2801 3681 3261 3841 3651 2691 2121 2161 2091 2591 1461 1781 0951 1171 0671 1281 1421 1381 1331 1521 1311 1361 0931 1171 1001 1561 1151 1161 1061 0631 0781 0811 1301 0749861 0631 0971 0301 1681 0951 0211 1031 0541 0361 0369731 0381 0709901 0071 0229839999371 0041 0039859911 044954918935968980978933910921946899901952912925822869972861862838934894848859881849853856789848864888843851800791767808842811866858844825796793815833828854796868816837817808857813786820797821764724722745744707726752800794695772763747744794876745780848828757781759726714711725754725790708709715752697689656663716694685652676703648688655687661635674643662673631655692621689696697702670692616633687598638640641624628597651584578627563639576617554603610562595582556614600676626581615584665668581573552546523530548553587532561609533596587570567617571506495550518509529494491496530483474450487512527480459478447505468494496538497481468449454500466462479507443478492474456502490452473500459440450468432452441470428493424450448457448416431461447470461446439429444441466395437447442429415450484461442457462449472465422475425488478426449495432423438447487428397408407437439467388431353454412427459433459441428411458382423449434405424422407419438437425461432428438443392449411401408350407398442415407391423398414382416405419376398375383344387373363381348349401375402360377339376368340354357386392355415386376340351359357338382360365363371328323346334338434378360347327357358336344342336348377330333329343325295331355324351322308273306332291277324331314296301314312282291301278297278276312255311273294278258253321286267269257298256284260269297267248279297265297288293282296278291287263275302264260269280226266209254245255263240271254261252243244234246254231216253279276254266280270271268251260249275225229272249256271263256269282234235236268263246266237235270220232258292227257274265264221253248229269249243289226247224246240242249234212231221213196222227225239225233210209251225195222221221246237211192236199194241212226255204198240228220208224176220230229218220217227223211224220231205221189212210209197218185217209204193216206191209193291 557100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

1 118 623000000047 833 357000741 211 227000000000438 615 1590000510 880 18600001 187 524 16400002 533 380 58300016 034 517 12700510152025303540Phred quality score0G2G4G6G8G10G12G14G16G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.5 %141 632 72999.5 %0.5 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %141 449 23899.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %183 4910.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %71 175 76350 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

97.4 %138 625 72897.4 %2.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

12.5 %17 801 71812.5 %87.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

6 813 957129 25480 376155 431117 075122 911136 851188 80987 613142 15166 11457 21177 53291 61347 002110 21278 26589 985110 854157 400164 366158 657199 707152 651243 495394 46724 577695 78135 99134 43268 40873 93436 39088 65136 86836 85657 78579 96421 638120 6431 737 41981 64174 802133 498110 368207 275183 091270 476470 72049 49867 75857 72077 31933 50068 36264 74945 501186 84146 741101 135127 935 074051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M100M110M120M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.87%99.86%99.86%99.86%99.86%99.86%99.87%99.87%99.87%99.87%99.87%99.87%99.86%99.87%99.87%99.89%99.88%99.86%99.89%99.87%99.87%99.87%99.9%99.88%0.13%0.14%0.14%0.14%0.14%0.14%0.13%0.13%0.13%0.13%0.13%0.13%0.14%0.13%0.13%0.11%0.12%0.14%0.11%0.13%0.13%0.13%0.1%0.12%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped