The method is based on the idea that we can calculate the number of motifs with given base composition and given number of mismatches in a random sequence (also with the given nucleotide composition). These predicted numbers can then be compared with the real occurrences of motifs in a query sequence.

The workflow is represented in Fig. 1:

Fig. 1. Main steps of the SiTaR algorithm. **Step 1**
(Search): Each searching motif (SM) is used as a search
template with different number of mismatches; **Step 2:**
For each found motif (FM) we mark the number of
mismatches between it and each searching motif; **Step 3:** For each SM we write down the number of its
counted (in the step 2) occurrences with given number
of mismatches and the corresponding predicted value of
such occurrence by chance. **Step 4:** The weights for
each SM are calculated. **Step 5:** Calculation of the
scores. The ranked list of the scores is filtered according
to the user’s wishes.

Principally, SiTaR does not require aligned sequences or even sequences of the same length. However, sometimes we encounter unreasonably long (up to 50 bp) TFBSs reported in literature and, consequently, in databases. (The normal TFBS length is 6-12 bp, in rare cases up to 18 bp. If a TFBS is too long (30-50 bp), it just means that the authors have not well defined the binding sites and reported approximate regions of the TFBS surroundings). In such cases, to get searching motifs of reasonable length the reported sequences should be aligned with the other TFBSs of the set. If all TFBSs of the set are too long, we would recommend to re-identify the “real” TFBSs within these sequences by looking for a common motif of 6-12 bp (using programs like MEME (Bailey and Elkan, 1994) or Gibbs Sampler (Thompson et al., 2003)). If the binding sites of the input set are well-defined and have comparable length (allowing the same number of mismatches), they can be submitted to the program without further treatment. We want to emphasize that it is not recommendable to look for motifs shorter than 6 bp in any case, because the probability of occurrence of such motifs by chance is too high.

The dependency of the time on the length of query sequences (L), number of TFBSs (N) and TFBS length (l) is linear and can be expressed by the formula:

The computational time is defined by the speed of the web server, on which the tool runs at the moment. There are two ways to accelerate the speed:

1. Send us the data for a run on a more powerful server, or

2. Download and run the tool locally. For getting the sourse code, please write us an email.

**1. Input TFBS sets.**

The TFBS sets should be provided in a special format looking like this:

>TFBS name

aggttgc

agggtgc

agggtgc

acgttgc

acgttgc

…

The length of the input TFBS sequences is theoretically not restricted, but we want to remind that the normal length of a TFBS is 8÷15 bp, sometimes coming up to 18, but normally not longer than that. If you are looking not for TFBSs but for some other motifs, you can consider the sites of any length you like. The number of the input TFBS sequences is not restricted, but the counting may become slow with the total number of sites higher than 200. The number of individual TFBS sets in one search run does not matter. The minimal number of sites in a set should be not less than 10. We would recommend to use the sets starting with 15-20 sites, when possible.

>TFBS name

aggttgc

agggtgc

agggtgc

acgttgc

acgttgc

…

The length of the input TFBS sequences is theoretically not restricted, but we want to remind that the normal length of a TFBS is 8÷15 bp, sometimes coming up to 18, but normally not longer than that. If you are looking not for TFBSs but for some other motifs, you can consider the sites of any length you like. The number of the input TFBS sequences is not restricted, but the counting may become slow with the total number of sites higher than 200. The number of individual TFBS sets in one search run does not matter. The minimal number of sites in a set should be not less than 10. We would recommend to use the sets starting with 15-20 sites, when possible.

Query sequences must be in FASTA format. The number and length of the sequences are not restricted. One may have problems with the computational time for very long sequences like whole-genome set of promoters,
because our server is not very powerful. If you encounter such problems, please contact us.

1. Paste your input sequences in the corresponding fields. If you are not sure or want to have a test run, you can load the Demo set by clicking on the corresponding button.

IMPORTANT!

For the promoter analysis is it not plausible to consider very long sequences.

We recommend to consider sequences with the length up to 1000-1500 bp, better less.

TFBSs with the lengths less than 6 and more than 18 are not useful!

For the promoter analysis is it not plausible to consider very long sequences.

We recommend to consider sequences with the length up to 1000-1500 bp, better less.

TFBSs with the lengths less than 6 and more than 18 are not useful!

2. Select the number of mismatches for your run. If you don’t do this, the tool will use 1 mismatch as the default value.

3. Click the Start button.

The results table shows top 5 results sorted by the score. It is possible to open as many rows as you want and sort the table by any column.

1. Adjusting the score.

To cut the obtained list of results to some particular score, you can insert the score value in the corresponding field (which opens after clicking on the „+“).

To get an impression about the behaviour of the TPs and FPs and the influence of the score, you can have a look at the plot.

To get an impression about the behaviour of the TPs and FPs and the influence of the score, you can have a look at the plot.

2. If you are not satisfied with results, you can select another number of mismatches. If you want to change something in the inputs, you can go back by clicking „Return to the input form“.