European Genome-Phenome Archive

File Quality

File InformationEGAF00001306955

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

145 801 744275 229 117389 971 187448 541 486437 394 784372 960 332284 082 232196 758 541125 685 66375 066 54042 418 54822 999 53012 230 0766 494 6503 564 2252 081 5031 334 948932 932706 741566 476468 480395 864339 597298 140261 122230 202203 160179 814161 002145 424131 305118 769106 78298 21788 48880 22873 78967 58661 79158 06154 75951 11247 97143 68041 05938 66137 57935 72233 92032 24531 39930 02028 65627 25526 64624 85323 54523 64122 84621 89821 13420 40819 32118 36417 09016 43115 87215 65115 77114 89014 45113 66113 61713 08612 89112 17912 10211 68811 41111 19111 12510 88510 72910 22610 0859 6259 6729 3988 9968 7658 6568 5238 0337 9357 6807 8267 6407 3957 5746 9436 9616 8746 6786 6246 4376 0215 9975 8195 8885 4875 3175 3525 3104 9615 0934 8894 7734 6344 5834 4654 6744 5604 5954 5164 6104 3154 3374 3134 2894 1984 0024 1123 9823 7953 7973 9883 8963 6693 7633 5383 5283 3313 5303 2683 3713 1033 2873 2923 3713 2113 0462 9383 0332 9762 9092 8102 8692 7472 8623 0652 9122 7372 8382 8022 5952 5272 6052 6872 5072 5532 5682 4792 5072 4832 3742 3662 4772 3862 2042 3882 2432 2382 3192 1572 2132 2642 3052 3062 1722 0322 0192 0682 2042 1522 0922 0682 0802 0291 9902 0062 0312 0632 0451 9791 8951 8631 8451 7421 7881 8361 8721 7771 8081 7781 8251 8851 8571 7781 8741 6721 7131 7641 7081 6121 7031 7051 6491 6261 8121 6881 6931 6311 6631 6361 6331 6491 5831 6001 5211 5101 4321 4681 3411 2951 2561 3461 3401 3641 3501 3651 2721 3321 3391 3681 3661 4111 3321 4351 3711 4321 4361 3181 2661 2891 3461 4201 3311 2341 2761 2341 1741 1761 1571 1861 0981 0461 1201 2061 1881 1181 1081 2191 1241 1671 0731 1451 1441 0541 0601 1111 0861 0671 0899809719069251 0029988699098829551 0001 0461 0951 0179961 0519861 026990961976993873972890912934940894892847836844824865867857842858889908925811805796762759766836762765731831806859810845788766830758809800750803756673760689730766705745651676648717674673723676749723775770735717673705645742707653718720660706688668710695693692683672712680638654641651681627598633646663665644627586623622644645609581626578634639669609593581594654560584579549520571569508555561556546475593544568576534501558543554512510526497555545504472501512531576529518543464497511465485511479499555494460482495527485469468485483505464460458446442407430455433457438460411411399399389388401425445413447435466432442441451458485422404456447445446434462451507495487441478476449488489510514512490494499477486479478488487502492448497427455459461466435447506482418442463447439416427499414420427435440454446417416415428408419436403392405413381331349374407391344362389359378357363317330320347329331331342356334343359339359336338305372346363321361328341287317313316291264279268281272292276299279262338302304266277277296257256258232245232220247238245238231247268243251238246251233241288254268239262246256243229291245257264247215225235224250211256250225244239249227220248258199227224274218228213218237239231214207211201205173224193227214187210220225206204218225208243250204233228228246243219227228218212222173218254247242209220229205237211217209226211206241203191225220214216192198208203204213205193222204214200207222286223220214218222230231192211225222206226219206197208292220236199200214203216200197182215240194216218193252205237197190208198229234208256213210219197215219219193229230235223185230248221219199273266237289255225279256247242264268238248239235248213225218200189206210200193196216236215200229237233201225198211226232215224226222211214232257260252247264214239238256241263269244233291231280267287303259279256267244267291249270286281260276303296283267311296291284289280271275260305317318304281308311328301281327318301325322300319325312299349341337354334311307330369352330338319299358304289349341341344323311332324338359345308315349325371310329338353328361319363322338343291120 303100200300400500600700800900>1000Coverage value1k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

2 073 93000000092 614 8790001 144 525 8680000000000535 418 2850000607 924 93300001 332 797 35900002 143 220 50300009 343 887 40700510152025303540Phred quality score0G1G2G3G4G5G6G7G8G9G# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.9 %100 583 82999.9 %0.1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.8 %100 516 32099.8 %0.2 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.1 %67 5090.1 %99.9 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %50 339 28250 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

96.3 %96 932 89496.3 %3.7 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

7.6 %7 619 9387.6 %92.4 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

9 941 081305 896130 349426 691150 142156 551626 955206 60195 187199 61571 41163 155215 58782 84645 707102 57864 58770 116123 839102 949108 248117 952182 68593 480154 077226 40734 083433 81039 15437 20366 56058 71541 37366 01936 28836 94648 72761 23027 18987 6701 027 89865 04570 32896 53586 032142 265123 894203 100256 25448 53756 96553 85063 35941 24367 80665 51453 382132 52957 53291 48490 232 641051015202530354045505560Phred quality score10M20M30M40M50M60M70M80M90M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.94%99.94%99.94%99.95%99.94%99.94%99.94%99.94%99.93%99.93%99.93%99.94%99.94%99.94%99.94%99.93%99.92%99.94%99.91%99.93%99.94%99.91%99.95%99.92%0.06%0.06%0.06%0.05%0.06%0.06%0.06%0.06%0.07%0.07%0.07%0.06%0.06%0.06%0.06%0.07%0.08%0.06%0.09%0.07%0.06%0.09%0.05%0.08%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped