European Genome-Phenome Archive

File Quality

File InformationEGAF00003613153

File Data

Base Coverage Distribution

This chart represents the base coverage distribution along the reference file. Y-axis represents the number of times a position in the reference file is covered. The x-axis represents the range of the values for the coverage.

Data is represented in a log scale to minimise the variability. A high peak in the beginning (low coverage) and a curve descending is expected.

125 171 13440 784 28413 266 3484 767 9031 967 018981 142570 435396 211296 761241 587205 307176 265158 327145 261131 579125 032116 828109 879102 53597 32394 31689 03083 68382 57779 63276 23975 12571 91670 15268 05565 42864 96162 53861 03360 17759 35557 75456 69455 80254 57853 31052 25452 24450 56149 17348 59148 74347 75847 19646 74045 82745 15745 31644 19444 06344 33042 60843 33241 42041 58641 33040 65140 51940 31140 28439 78838 95739 09538 98137 96737 16237 59537 18837 32536 92136 40735 96036 41135 60535 18535 51634 94833 88834 14834 04634 22433 68333 25033 11333 05233 09932 61331 73731 87931 50731 10131 03030 47329 64430 31229 95229 62229 01928 96828 96829 20029 05328 57527 94928 06127 61827 51727 73127 14826 73027 12427 09525 93026 64326 12826 11526 20924 88725 54524 91324 93625 02423 93424 45224 20924 12024 25423 63623 34923 02922 54622 84422 23321 94022 13021 48921 79221 30221 36221 07620 81220 36820 81119 79620 11620 35519 81919 60519 27119 18919 20919 02918 58218 47018 07418 08617 71617 32817 40717 05717 04617 00616 70416 52616 32115 92415 97915 77715 61515 35215 73815 47915 11814 99015 02114 41014 48314 26514 51714 07314 10013 77913 70313 87013 68713 40013 28013 02813 20512 96412 56312 55412 63012 29212 24212 04912 33411 90311 93911 92011 38711 32411 52611 25110 98611 24710 76010 82610 39410 54410 36210 61410 56910 33910 1169 9039 8429 8279 5459 5549 5009 2589 1098 9639 0168 8078 7938 8398 7678 5258 2258 3678 0268 2658 1198 0127 9327 6157 6877 6317 4467 1917 2637 2366 9987 2686 8216 6666 8606 9136 6196 4156 3416 3866 2676 2016 1045 9565 9816 1226 1275 9575 7515 6815 6195 3605 6065 4505 5805 3415 3195 2785 2395 1335 1285 0115 0174 9214 7385 1114 6494 8034 6994 7314 6744 5084 6604 2864 3034 3124 2064 3153 9914 1744 1254 2104 1064 1303 8683 9003 7463 8183 8753 8213 8013 6343 6163 5023 5433 6033 4203 5483 5103 4513 5083 3883 4943 3433 2883 1653 1023 1083 1673 0803 2353 1292 9863 0073 0073 0392 9653 0872 9132 9762 8872 8682 9122 8042 8282 6792 6722 8082 6982 7842 6262 8082 6352 5182 6842 5552 3942 6512 5242 4382 3622 4912 3592 4482 4862 3152 3732 3092 2432 2302 2042 3562 2372 2742 1362 2152 2412 1592 1882 1992 1362 0492 2172 1752 1172 0252 0902 1492 0672 0271 9221 9461 9852 0161 9772 0261 9331 9901 9952 0461 8941 9201 9131 9041 8361 8901 9091 8631 8711 9271 7491 7751 8791 7931 7401 7881 7671 7281 8101 7121 6481 7541 6541 5991 7251 7681 5971 6381 6061 6691 6231 6271 6131 6021 5421 5281 5981 4961 6121 5261 5311 5721 4201 5041 4871 4341 4181 4621 4861 4991 4981 4261 5121 4991 3711 4101 3941 4151 3451 3231 4651 3011 3601 3841 3171 3651 3161 3901 3891 3471 4401 3791 2691 3751 3671 2991 3161 2141 2551 2791 2251 2501 2291 1821 2141 2631 2481 1131 2331 0881 1191 1061 1231 1541 1401 0821 1601 1341 0131 0121 0739809981 0279841 0941 0761 0089671 0231 0071 0369831 0139589839729519441 000976899975879881947903953883894950931863862916913827877859858861843788806851830770811726823839790818798706861788767811707795754799732798737826748650709736780754658768708699698671701711663658668673641606676695672621581614587592621592560563583551614624529558569519545518509549469554539537505499448484463488466433518431452480512406424421431410489429347353408400390426386381347398371372368348329366352350340304342321323314314339301316301292269274304336290297264293266265291246260249264232229250238236231234234256264236235179196249215219225202229233206217212187202206178172229217204186208176145206157189178152161165160162124154154157146130137164155145143131138150137144118129169123126132110119114101108127981029911310912311511511088113107102889889831098810310397105897410685711018782828581878493729469727384836756858249766753515955615754414457695553544755626156524234544638344852514846554340554141393933514437472339495534314635403340313636362836372431283229253340312129232921182120212516231824252828182416142415222417202129171920371817171521171719171917171814182312172311911151214121321179119131398141212111371271012125710191755841213694685108797756643268978641043117393962 460100200300400500600700800900>1000Coverage value101001k10k100k1M10M100M# Bases

Base Quality

The base quality distribution shows the Phred quality scores describing the probability that a nucleotide has been incorrectly assigned; e.g. an error in the sequencing. Specifically, Q=-log10(P), where Q is the Phred score and P is the probability the nucleotide is wrong. The larger the score, the more confident we are in the base call. Depending on the sequencing technology, we can expect to see different distributions, but we expect to see a distribution skewed towards larger (more confident) scores; typically around 40.

283 8770000000000043 851 36400000000025 083 540000039 952 264000079 075 2900000169 434 232000877 193 53300510152025303540Phred quality score0M100M200M300M400M500M600M700M800M# Bases

Mapped Reads

Number of reads successfully mapped (singletons & both mates) to the reference genome in the sample. Genetic variation, in particular structural variants, ensure that every sequenced sample is genetically different from the reference genome it was aligned to. Small differences against the reference are accepted, but, for more significant variation, the read can fail to be placed. Therefore, it is not expected that the mapped reads rate will hit 100%, but it is supposed to be high (usually >90%). Calculations are made taking into account the proportion of mapped reads against the total number of reads (mapped/mapped+unmapped).

99.7 %16 417 33199.7 %0.3 %

Both Mates Mapped

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. This chart shows the fraction of reads in pairs where both of the mates successfully map to the reference genome. .

Notice that reads not mapped to the expected distance are also included as occurs with the proper pairs chart.

99.4 %16 372 91499.4 %0.6 %

Singletons

When working with paired-end sequencing, each DNA fragment is sequenced from both ends, creating two mates for each pair. If one mate in the pair successfully maps to the reference genome, but the other is unmapped, the mapped mate is a singleton. One way in which a singleton could occur would be if the sample has a large insertion compared with the reference genome; one mate can fall in sequence flanking the insertion and will be mapped, but the other falls in the inserted sequence and so cannot map to the reference genome. There are unlikely to many such structural variants in the sample, or sequencing errors that would cause a read not to be able to map. Consequently, the singleton rate is expected to be very low (<1%).

0.3 %44 4170.3 %99.7 %

Forward Strand

Fraction of reads mapped to the forward DNA strand. The general expectation is that the DNA library preparation step will generate DNA from the forward and reverse strands in equal amounts so after mapping the reads to the reference genome, approximately 50% of them will consequently map to the forward strand. Deviations from the 50%, may be due to problems with the library preparation step.

50 %8 232 49450 %50 %

Proper Pairs

A fragment consisting of two mates is called a proper pair if both mates map to the reference genome at the expected distance according to the reference genome. In particular, if the DNA library consists of fragments ~500 base pairs in length, and 100 base pair reads are sequenced from either end, the expectation would be that the two reads map to the reference genome separated by ~300 base pairs. If the sequenced sample contains large structural variants, e.g. a large insertion, where we expect the reads mapping with a large separation would be a signal for this variant, and the reads would not be considered as proper pairs. Based on the sequencing technology, there is also an expectation of the orientation of each read in the fragment.

The rate of proper pairs is expected to be well over 90%; even if the mapping rate itself is low as a result of bacterial contamination, for example.

98.9 %16 281 53498.9 %1.1 %

Duplicates

PCR duplicates are two (or more) reads that originate from the same DNA fragment. When sequencing data is analyzed, it is assumed that each observation (i.e. each read) is independent; an assumption that fails in the presence of duplicate reads. Typically, algorithms look for reads that map to the same genomic coordinate, and whose mates also map to identical genomic coordinates. It is important to note that as the sequencing depth increases, more reads are sampled from the DNA library, and consequently it is increasingly likely that duplicate reads will be sampled. As a result, the true duplicate rate is not independent of the depth, and they should both be considered when looking at the duplicate rate. Additionally, as the sequencing depth in increases, it is also increasingly likely that reads will map to the same location and be marked as duplicates, even when they are not. As such, as the sequencing depth approaches and surpasses the read length, the duplicate rate starts to become less indicative of problems.

65.5 %10 784 62365.5 %34.5 %

Mapping Quality Distribution

The mapping quality distribution shows the Phred quality scores describing the probability that a read does not map to the location that it has been assigned to (specifically, Q=-log10(P), where Q is the Phred score and P is the probability the read is in the wrong location). So the larger the score, the higher the quality of the mapping. Some scores have a specific meaning, e.g. a score of 0 means that the read could map equally to multiple places in the reference genome. The majority of reads should be well mapped, and so we expect to see this distribution heavily skewed to a significant value (typically around 60). It is not unusual to see some scores around zero. Reads originating from repetitive elements in the genome will plausibly map to multiple locations.

479 3816 0672 2169 9292 7032 8395 7906 0642 9229 6114 0603 6998 2194 4742 0108 0402 6903 5147 2377 1264 6769 3818 00910 67019 16029 5021 56282 2652 3574 4017 8243 8861 3188 2511 4782 1354 0004 1071 1899 501245 5008 0167 94411 77810 41615 07613 14116 89617 43832 50334 68826 84498 8714 77823 38410 5266 00032 2294 1773 84215 062 693051015202530354045505560Phred quality score2M4M6M8M10M12M14M# Reads

Mapped vs Unmapped

Stacked column chart for both mapped and unmapped reads along all chromosomes in the reference file. It is a similar representation as shown in the Mapped reads chart but for each chromosome. Although sequenced sample may be a female, it is possible to get reads in the Y chromosome as there are common regions in both chromosomes called pseudoautosomal regions (PAR1, PAR2).

Unmapped reads belonging to each chromosome are determined when the one mate/pair is aligned and the other is not. The unmapped read should have chromosome and POS identical to its mate. It could also be due when aligning is performed with bwa as it concatenates all the reference sequences together, so if a read hangs off of one reference onto another, it will be given the right chromosome and position, but it also be classified as unmapped.

99.73%99.72%99.72%99.72%99.72%99.72%99.72%99.73%99.74%99.7%99.75%99.74%99.72%99.74%99.69%99.74%99.74%99.72%99.76%99.72%99.73%99.72%99.32%99.84%0.27%0.28%0.28%0.28%0.28%0.28%0.28%0.27%0.26%0.3%0.25%0.26%0.28%0.26%0.31%0.26%0.26%0.28%0.24%0.28%0.27%0.28%0.68%0.16%123456789101112131415161718192021XYM0%10%20%30%40%50%60%70%80%90%100%mappedunmapped