You are here: Home > HELP

Annotate Pipeline

Annotate Protein Sequences

1) Email
- Input a valid email address and dbCAN meta server will email you when your job completes
2) Sequence Type
- To annotate a protein sequence, select "Protein sequence"
3) Run CGC-Finder
- Select "Yes" to have CGC-Finder predict CAZyme gene clusters (CGCs).
- CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter/TC gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates. See our paper for why CGCs are interesting to identify.
4) Auxillary Locations File
- If you chose to run CGC-Finder, you must upload a GFF or BED format file (see here for an example) that contains position data on each gene you upload.
5) Sequence Input
- You can either paste protein sequences into the textbox or upload a file containing protein sequences. In either case, the sequences must be in FASTA format.

Annotate Nucleotide Sequences

1) Email
- Input a valid email address and dbCAN meta server will email you when your job completes
2) Sequence Type
- To annotate a nucleotide sequence, select "Nucleotide sequence"
3) Nucleotide Sequence Type
- To annotate prokaryote genomes, select "Complete/draft prokaryote genomes". To annotate metagenomes, select "Metagenomes". FragGeneScan is used for gene prediction in Metagenomes, and glimmer is used for gene prediction in prokaryote genomes. If you have an eukaryotic genome, please run gene finding softwares elsewhere (e.g. MAKER) and then submit protein sequences.
4) Run CGC-Finder
- Select "Yes" to have CGC-Finder predict CAZyme gene clusters.
5) Sequence Input
- You can either paste DNA sequences into the textbox or upload a file containing protein sequences. In either case, the sequences must be in FASTA format.

Output Info

1) Venn Diagram
- This diagram visually displays how CAZyme annotations were shared amoung the three tools. The digram is interactive and downloadable. Clicking on a number will display a popup with all the the genes that belong in that section of the diagram.
2) DIAMOND
- This tab displays the results of the DIAMOND blast versus the CAZy database. The full output is avaliable for download via a link at the top of the tab.
3) HMMER
- This tab displays the results of the HMMER run versus the dbCAN database. The full output is avaliable for download via a link at the top of the tab.
4) Hotpep
- This tab displays the results of the Hotpep run versus the PPR database. The full output is avaliable for download via a link at the top of the tab. The second column (CAZy Family) contains links to the conserved peptides file for appropriate CAZy family. These files are a part of the PPR database.
5) CGC-Finder
- This tab dispalys the output of CGC-Finder, if the user chose to run CGC-Finder. Several files are avaliable for download at the top of this tab. The full input and output files for CGC-Finder are avaliable, along with the full DIAMOND outputs that were used to annotate genes as TC (transporter)'s or TF (transcription factor)'s.
- CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter/TC gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates.
6) Rerun CGC-Finder
- At the bottom of the CGC-Finder tab, users can choose to rerun CGC-Finder with custom settings. The distance setting is the maximum number of non-signature genes aloud between signature genes. The signature genes setting is which signature genes are required to be in a cluster in order for the cluster to be annotated as a CGC.
7) Overview
- This tab shows an overview of all the tools run. Each annotated gene is displayed along with which tools annotated it and what CAZy family they were annotated in. Each CAZy family is also a link to the CAZy web page for the appropriate family. Along with this, signal peptide predictions are displayed. The full signalp output is avaliable for download at the top of the tab. The table is also available for download along with the gene predictions (if a nucleotide sequence was uploaded).

Tool Info