European Genome-Phenome Archive

File Quality

File InformationEGAF00000659677

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

70 285 20536 433 66014 216 9366 915 9882 872 1361 898 4091 009 904764 870496 411389 248283 876230 165185 269153 455125 982106 55892 40278 70069 15162 03655 33748 58644 44340 87537 15133 97331 79529 53627 24225 17524 21522 15121 20619 97319 17718 12917 81516 96416 26715 57115 26614 98514 11614 01513 19713 21213 02012 69712 16412 18011 73711 33211 27711 00110 70410 33110 28410 38510 0739 7679 5859 5449 5909 1309 0949 1528 9068 7778 6788 6208 4808 2948 0648 0098 0237 9307 7847 7617 5887 6587 7317 6357 5037 2357 2346 9177 0306 8136 7756 7956 8796 8246 8726 6866 7056 6836 4746 4456 3666 3826 3876 3626 2676 1616 0586 1066 0485 9315 9505 8985 8015 6365 5915 6065 4455 6185 3685 4015 4525 4815 2965 3715 3165 2965 1605 2145 1105 2015 1205 1075 1325 1124 9965 0485 0804 9844 9904 9424 8365 0334 8605 0224 8284 8234 7134 7984 7344 7284 7074 5444 6964 5104 6954 6214 6304 6104 6984 6254 5764 4434 5674 4214 4104 3974 3384 3474 3744 2204 1664 1724 2144 2184 2594 2174 2654 2214 1884 1214 0104 1554 1244 1154 1164 2114 0354 0814 1804 0814 0694 0673 9683 9393 9504 0723 7903 9113 8003 8343 9083 9283 8343 6753 8613 8273 8463 7233 8443 7893 6963 7593 8793 7093 7033 7793 7573 7653 6633 6453 5583 5493 6233 6853 5553 5673 6043 6043 6503 5563 5123 5633 5563 5893 4633 4083 4683 4103 3383 4263 2783 3323 3653 2213 4533 3543 2813 2553 2633 1483 2253 1913 1513 1003 0683 1283 0303 0163 0413 0163 0192 9983 0622 9913 0613 0553 0212 8922 8892 8862 8273 0232 8452 8842 9032 8482 7632 8222 7872 6362 7102 6842 6392 6962 7212 6352 6252 5752 6102 5282 5842 5212 5322 5852 5272 4392 5562 4132 4712 4412 5302 4712 3982 2712 3372 2902 2162 2852 1712 1912 1582 2092 2332 1972 1432 1292 1072 0782 0042 0202 0162 0221 9641 9882 0261 9161 9351 9351 8821 8221 8311 8491 8491 8611 8251 7681 8281 8651 8381 8081 8101 8571 8351 8271 7871 8201 6971 7001 7941 6771 7431 7051 5861 7381 6591 6241 6471 5521 4781 5511 4611 5021 4901 5471 4881 5721 4971 4611 5131 4951 5121 4161 4531 3831 4911 4521 4711 3591 3821 3821 4021 3281 4221 2731 2881 2721 2711 3181 3571 3161 3071 2981 3611 2931 1661 2201 2401 2141 2421 1211 2021 1521 2421 1771 0631 0831 1021 0671 1281 0901 0831 0721 0491 0641 0771 0981 0081 073994967946988971954961938905981943899927932952908874916919918930881879877831833795849833847868827857775785772744745758796796822767703747731704729680737707677680687705633666689658611590637616591616579563621563584577578538572583508522554534523471535505470481469477488463450450450432461444436427393428401419421391361410389377383390362379371387358342341322354310325330324317318321307326378318323297318309279314267266251284260313250297281256254271260266250235236233223266255215217199224211214223229218207202203224212163177170192203164176168170179180170165151153153188152144128140159165123160152135156146154159165137161175158131142156143130143148127127149139125138122140124113117112116119981221029211410310910911892116100899483718267776666878086587253616959605342484555644553814346445055564951704948484359364144394953334941525540584043544644514352434348454545454841494145534651454541382938385042505244444547414137334135352532254130323224311625372330263332252731222919293023263626303326322824293034222619182423192719201718242816182317142318172024142016131613212411211818191514101492081381210109681087767101517577915109510611121416711101513171012191214151712141113916813107121411101011917141381181211131212139811410128898999138868158788911139791014151017136612810116153991220115101179121712912910671110915612991510121011841175811996869132 746100200300400500600700800900>1000Coverage value101001k10k100k1M10M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

348 7300000713 739334 8441 944 1481 314 878421 155827 842324 126368 046582 680235 350806 932555 534672 6441 177 896611 0961 138 992660 4651 204 1211 713 2871 783 5732 045 0333 337 2393 238 5303 008 1343 384 4397 482 21112 884 2617 409 99614 137 41929 351 07442 429 65120 542 93260 112 66238 315 05372 777 30683 819 960190 049 72200510152025303540Phred quality score0M20M40M60M80M100M120M140M160M180M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99 %8 080 66299 %1 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

98.7 %8 057 62498.7 %1.3 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %23 0380.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %4 080 43850 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.4 %8 029 43698.4 %1.6 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

33 %2 697 16333 %67 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

350 0751 3086522 2731 0981 2571 1141 9271 7929 5006 3103 10211 5263 2502 09432 7233 85313 78110 5403 13018 59764616 56861 3826074 102482437624276 6531 2098209121 2889301 56436 768135 4504 1521 3204 7744 2921 2128 6962 2102 75231 2164 6224 8426 41010 6245 00414 25010 55814 64026 47648 6666 933 816051015202530354045505560Phred quality score0.5M1M1.5M2M2.5M3M3.5M4M4.5M5M5.5M6M6.5M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped