The method is based on the idea that we can calculate the number of motifs with given base composition and given number of mismatches in a random sequence (also with the given nucleotide composition). These predicted numbers can then be compared with the real occurrences of motifs in a query sequence.
The workflow is represented in Fig. 1:
Fig. 1. Main steps of the SiTaR algorithm.
Step 1 (Search): Each searching motif (SM) is used as a search template with different number of mismatches;
Step 2: For each found motif (FM) we mark the number of mismatches between it and each searching motif;
Step 3: For each SM we write down the number of its counted (in the step 2) occurrences with given number of mismatches and the corresponding predicted value of such occurrence by chance.
Step 4: The weights for each SM are calculated.
Step 5: Calculation of the scores. The ranked list of the scores is filtered according to the user’s wishes.
Principally, SiTaR does not require aligned sequences or even sequences of the same length. However, sometimes we encounter unreasonably long (up to 50 bp) TFBSs reported in literature and, consequently, in databases. (The normal TFBS length is 6-12 bp, in rare cases up to 18 bp. If a TFBS is too long (30-50 bp), it just means that the authors have not well defined the binding sites and reported approximate regions of the TFBS surroundings). In such cases, to get searching motifs of reasonable length the reported sequences should be aligned with the other TFBSs of the set. If all TFBSs of the set are too long, we would recommend to re-identify the “real” TFBSs within these sequences by looking for a common motif of 6-12 bp (using programs like MEME (Bailey and Elkan, 1994) or Gibbs Sampler (Thompson et al., 2003)). If the binding sites of the input set are well-defined and have comparable length (allowing the same number of mismatches), they can be submitted to the program without further treatment. We want to emphasize that it is not recommendable to look for motifs shorter than 6 bp in any case, because the probability of occurrence of such motifs by chance is too high.
The dependency of the time on the length of query sequences (L), number of TFBSs (N) and TFBS length (l) is linear and can be expressed by the formula:
The computational time is defined by the speed of the web server, on which the tool runs at the moment. There are two ways to accelerate the speed:
1. Send us the data for a run on a more powerful server, or
2. Download and run the tool locally. For getting the sourse code, please write us an email.
1. Input TFBS sets.
1. Paste your input sequences in the corresponding fields. If you are not sure or want to have a test run, you can load the Demo set by clicking on the corresponding button.
The results table shows top 5 results sorted by the score. It is possible to open as many rows as you want and sort the table by any column.
1. Adjusting the score.
2. If you are not satisfied with results, you can select another number of mismatches. If you want to change something in the inputs, you can go back by clicking „Return to the input form“.