CASSIS

CASSIS ("cluster assignment by islands of sites") is a tool to predict secondary metabolite gene clusters around a given anchor/backbone gene. A gene cluster is a small group of genes, which are tightly co-localized, co-regulated, and participate in the same metabolic pathway.

CASSIS utilizes a so-called "motif-based" prediction method. It is mainly based on the hypothesized co-regulation of cluster genes. Hence, CASSIS searches for transcription factor binding sites shared by promoter sequences of putative cluster genes.

The motif-based method applied by CASSIS is complementary to similarity-based methods, such as those exploited by antiSMASH or SMURF.

SMIPS

SMIPS ("secondary metabolites by InterProScan") is a tool for genome-wide prediction of anchor/backbone genes. Anchor genes encode enzymes, which play a major role in the biosynthesis of secondary metabolites. SMIPS identifies three most common classes of the anchor genes: polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS), and dimethylallyltryptophan synthases (DMATS).

The anchor gene predictions made by SMIPS are based on protein domain annotations provided by the InterProScan tool.

General idea:

Promoter-based prediction of secondary metabolite gene clusters

Gene clusters are defined as sets of co-localized and co-regulated genes, the products of which are presumably functionally connected. The co-regulation assumes the existence of common regulatory patterns (binding sites for the common transcription factor (TF) in the cluster promoters.

first Cluster

Genes involved in secondary metabolite (SM) biosynthesis are often organized in clusters, where the role of the anchor gene is played by polyketide synthase (PKS), non-ribosomal peptide synthetase (NRPS), or dimethylallyl tryptophan synthases (DMATS).

second Cluster

In fungi, SM clusters typically have modest sizes (normally up to 20 genes), are characterized by tight co-localization of successive genes and are often regulated by a cluster specific transcription factor (csTF), which can be a part of the respective cluster.

How it works

The search is made in two steps:

  • (i) Identification of the anchor genes (also referred as backbone genes): PKSs, NRPSs, DMATS. This step can be accomplished by the SMIPS tool or independently.
  • (ii) Detection of the clusters around the anchor genes by the CASSIS tool.

Here is the whole workflow in more detail:

Step 1. Detection of the anchor genes by the SMIPS tool.

This step can be omitted if you already know your anchor genes.

InterPro Logo down arrow SMIPS Logo big
down arrow anchor gene

Protein sequences are submitted to InterProScan to predict protein domains.

(This step is omitted if you already have InterProScan tables).

SMIPS tool predicts the secondary metabolite (SM) anchor genes based on the protein domain annotations by InterProScan.

The SMIPS predictions can be used for genome-wide annotation, characterization of particular genes, etc. They also can serve as the input for CASSIS.

Step 2. Detection of the clusters around the anchor genes by the CASSIS tool:

promoter sets down arrow MEME Logo Motif down arrow FIMO Logo down arrow found motifs down arrow gene cluster

Sets of interim promoters are selected around the anchor gene (provided by SMIPS).

The promoter sets are submitted to MEME for motif predictions.

The motifs are then submitted to FIMO for a genome-wide search in all promoter sequences (Pr1, Pr2, ...).

The sequence of promoters, each characterized by the number of found motifs, is considered as a string of numbers. This number string is searched for an "island" of non-zero values, which is regarded as the cluster.

The "island" of sites around the anchor gene corresponds to the cluster region. If different motifs result in different cluster predictions, the most abundant one will be reported.

How to start

There are several options to start, depending on which input data you have at hand.

For SMIPS, there are two input options (see SMIPS help page for formats):

  • InterProScan or JGI-InterProScan output files. See SMIPS InterProScan input.
  • One can start from scratch with a protein FASTA file. In this case, SMIPS will first run the InterProScan and then will start the predictions. See SMIPS protein input.

So to start SMIPS, you need only to select the correct input window, upload your input file and press the Run SMIPS button.

For CASSIS, one has to have three input types (see CASSIS help page for formats and examples):

  • Genomic sequence: A multiFASTA file containing the DNA sequences of all contigs (chromosomes, scaffolds) of the species. Contig numbers must coincide with those in the annotation file.
  • Annotation file: A text file with at least five columns (gene <string> | contig <string> | start position <int> | stop position <int> | strand <+ or->). Sorted by contig, start coordinate and stop coordinate.
  • Anchor gene name or ID: Feature ID of the clusters anchor gene. The ID has to coincide with the annotation file. This gene will be the starting point of the cluster prediction.

CASSIS allows to adjust several parameters. Please consult the CASSIS help page and the CASSIS publication for details.

Output files

SMIPS

SMIPS shows the prediction results immediately on screen (InterProScan input) or will send you a link to the result page via email (protein input). In addition, it generates a text file for download with the following content in tabular form:
(see SMIPS help page for details)

Gene ID Secondary metabolite type Domain arrangement Domain description

CASSIS

CASSIS sends the prediction results via email. The final prediction is a table containing the following information:
(see CASSIS help page for details)

Gene names Promoter numbers Length Motif scores Abundances
First and last gene name of the gene cluster predicted by CASSIS First and last promoter of the gene cluster predicted by CASSIS Length of gene cluster predicted by CASSIS MEME motif score CASSIS abundance score

Additional prediction details will be attached to the email as a zip file. It contains the following information:

  • A file with of all possible cluster predictions around the given anchor gene (e.g. "Afu6g09660_all_predictions.csv").
  • A folder containing additional information for the final cluster prediction and the corresponding binding site motifs (e.g. "Afu6g09630_to_Afu6g09785")
5120