Home
Services |
OligoCounter parametersParameters in the OligoCounter text menu: Enter a number to change an option For a quick explanation, see the example at the bottom of this page. Lower oligo frequency thresholdThis is the minimum number of instances an oligo needs to have (i.e. times it must occur in the genome) to be included further analysis. This is the first threshold to be applied to the data. Set this threshold very low, perhaps 10 or 20, if you are analysing small genomes (i.e. less than 1MB). Chi squared significance thresholdOligoCounter typically finds millions of oligos in a genome. To decide which of these can be termed overrepresented, we used a statistical approach with chi-squared. Chi-squared statistics were calculated according to the formula: (observed count – expected count) ^ 2 / expected count. Expected counts E of an oligonucleotide in a genome were derived by a zero order markov model E = N * A^a * C^c * G^g * T^t where N is the genome size in nucleotides A is the proportion of adenine in the genome and a is the number of adenines in the oligo (and so on for C, G and T) The chi-squared statistic is not intended to be an indicator of statistical significance but merely of level of overrepresentation of each oligo, otherwise Bonferroni corrections for multiple tests would have been additionally carried out. Second Chi squared significance thresholdAs above. This option allows chi squared data at another threshold level to be filtered from the dataset. It has the advantage that the genome does not have to be read in again to get data - it is a lot faster to get two sets of output files after reading the genome in once than to run OligoCounter twice, each time with one chi-squared threshold. Heuristic version, saves memory, 1 indicates on, 0 offThe heuristic version of OligoCounter removes oligos from memory (after every 100kbp) while scanning through the genome if they are present less than x times. x is set in the next option (4). While saving memory, this option is mainly useful for just finding very highly overrepresented words since it makes assumptions about the distribution of oligos which are present less than x times per 100kbp. Heuristic frequently removes oligos with counts below : 2See above option for details : this sets the heuristic option x A parameters exampleOligoCounter Readout:Settings saved, starting OligoCounter List of files to be analysed [AE007871_modified.fna]
The genome file should have been read in and the number encoded intermediate results files called temp_results.txt and resultsHash.txt should have been successfully created in the same directory if no errors occurred in this process 1198192 distinct oligos were found in the genome 3196 were found in the genome more than 10 times Total nucleotides counted: A 46548 T 46218 G 61549 C 59792 Total 214107 run 1 chiThreshold 100 Genome GC content is : 56.673 Statistical results are now being sorted Statistical results are now being printed to file resultsStats.txt Starting the sort mechanism to sort oligos Number of oligos above the chi-squared threshold: 22 Positional results are now being sorted by chi squared value OligoCounter completed this genome successfully provided non-empty files were created; if files are empty check your input settings first (especially lower threshold) Explanation 1198192 distinct oligos were found in the genomeWe have over one million hits : these are the unfiltered oligos found in the genome, most of which are only present one or several times. 3196 were found in the genome more than 10 times The "lower oligo frequency threshold" has now been implemented and leaves roughly 3000 oligos. run 1 chiThreshold 100 A chi squared significance threshold of 100 was used Number of oligos above the chi-squared threshold: 22 The applied chi-squared level leaves 22 oligos - these oligos and their genomic positions are then sorted and printed to file. |