You are here: Home > HELP

Overall design of dbCAN2 meta server



Tool Info

Annotate Protein Sequences

1) Email
- Input a valid email address and dbCAN meta server will email you when your job completes
2) Sequence Type
- To annotate a protein sequence, select "Protein sequence"; to try out the example, right click and save as the file to your computer and then upload (see 5 below)
3) Select tools to run
- Default is to have HMMER, DIAMOND, and Hotpep checked and CGC-Finder unchecked. Only selecting HMMER will have the same result as the original dbCAN server.
- Selecting CGC-Finder will display the gene position file upload button; you must upload a gene position file (example provided) to have CGC-Finder predict CAZyme gene clusters (CGCs).
- CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter/TC gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates. See our paper for why CGCs are interesting to identify.
4) Gene Positions File
- If you chose to run CGC-Finder, you must upload a GFF or BED format file (see here for an BED example) that contains position data on each gene you upload. Gene ID's used in the FASTA file must exactly match those in the BED/GFF file. If using a GFF file, only rows with 'CDS' in the type column will be considered. If using a GFF file, gene ID's should be in the notes column with the Name tag; if no Name tag is present then the gene ID should be in the ID tag.
5) Sequence Input
- You can either paste protein sequences into the textbox or upload a file containing protein sequences. In either case, the sequences must be in FASTA format.

Annotate Nucleotide Sequences

1) Email
- Input a valid email address and dbCAN meta server will email you when your job completes
2) Sequence Type
- To annotate a nucleotide sequence, select "Nucleotide sequence"; to try out the example, right click and save as the file to your computer and then upload (see 5 below)
3) Nucleotide Sequence Type
- To annotate prokaryote genomes, select "Complete/draft prokaryote genomes". To annotate metagenomes, select "Metagenomes". FragGeneScan is used for gene prediction in Metagenomes, and Prodigal is used for gene prediction in prokaryote genomes. If you have an eukaryotic genome, please run gene finding softwares elsewhere (e.g. MAKER) and then submit protein sequences.
4) Run CGC-Finder
- Select "Yes" to have CGC-Finder predict CAZyme gene clusters.
5) Sequence Input
- You can either paste DNA sequences into the textbox or upload a file containing DNA sequences. In either case, the sequences must be in FASTA format.

Result page

1) Venn Diagram
- This diagram visually displays how CAZyme annotations were shared amoung the three tools. The digram is interactive and downloadable in different formats (PNG, SVG, CSV). Clicking on a number will display a popup with all the the genes that belong in that section of the diagram.
2) Overview
- This tab shows an overview of all the tools run. Each annotated protein is displayed along with which tools annotated it and what CAZy family they were annotated in. Each CAZy family is also a link to the CAZy web page for the appropriate family. Along with this, signal peptide predictions are displayed. The full signalp output is avaliable for download at the top of the tab. The table is also available for download along with the gene predictions (if a nucleotide sequence was uploaded).
- The # of Tools can be sorted and proteins predicted by more tools are more reliable CAZyme candidates. Our benchmark analysis suggests keeping proteins found by >=2 tools can give the best CAZome annotation performance.
-About compasiron of the three tools: se have also systematically compared the outputs of the three tools against the CAZy pre-annotated CAZomes (i.e., as the gold standard sets) of three bacterial genomes and three eukaryotic genomes. The accuracy is calculated as an F-score = 2 × (Recall × Precision)/(Recall + Precision) for the three methods on each examined genome, following the method presented in our dbCAN-seq paper and PlantCAZyme paper. We removed unclassified CAZymes (e.g. GH0) and families not in the PPR library when calculating F-scores. Table S1 shows the best parsing thresholds that we selected to use for the web server: (i) for HMMER+dbCAN, we use E-value < 1e-15 and coverage > 0.35; (ii) for DIAMOND+CAZy, we use E-value < 1e-102; and (iii) for Hotpep+PPR, we use the number of conserved peptide hits > 6 and the sum of conserved peptide frequencies > 2.6. With these best thresholds, Table 1 shows that DIAMOND+CAZy has the highest F-score (0.89) for bacteria but the lowest F-score for eukaryotes (0.84); in contrast, Hotpep+PPR has the highest F-score (0.94) for eukaryotes but the lowest F-score for bacteria (0.80); HMMER+dbCAN performs very well for both eukaryotes (0.86) and bacteria (0.88) and a slightly higher overall F-score than the other two tools (Table S1). In terms of running time, DIAMOND runs the fastest, followed by Hotpep and HMMER. More importantly, we found that the best performance of automated CAZyme annotation is to aggregate the outputs of the three methods and keep candidates found by at least two methods. Table 1 shows that the F-score can go up to 0.93 when keeping proteins found by at least two tools.
- Advantage of HMMER search against dbCAN: However, the F-score calculation only considered whether a protein is found by any of the three tools. It did not consider if the protein is assigned to the correct family or families, if the protein has multiple CAZyme domains, and where the domain boundaries are. The below Figure shows two example CAZyme proteins found by all the three tools. Both proteins have multiple CAZyme domains according to dbCAN annotation (Figure A). According to HMMER+dbCAN output (Figure C), AT1G11720.1 is annotated as CBM53(154-237)+CBM53(329-423)+CBM53(496-584)+GT5(595-1038) and YP_002573728.1 as GH9(36-466)+CBM3(491-576)+CBM3(724-804)+CBM3(923-1003)+GH48(1134-1753), i.e., all CAZyme domains and domain repeats and their positions are reported (Table 1). According to both Hotpep+PPR output and DIAMOND+CAZy output, AT1G11720.1 is annotated as GT5+CBM53 and YP_002573728.1 as GH9+GH48+CBM3, i.e., proteins are assigned to the multiple families correctly, though without reporting domain repeats and positions (Table 1). It should be mentioned that DIAMOND+CAZy has a much higher risk than the other two tools to give wrong CAZyme family annotation. For example, if a query protein only has a GT5 domain and has AAD30251.1 as its best CAZy hit, transferring the family assignment of AAD30251.1 (GT5+CBM53) to the query would be wrong (as no CBM53 in the query). However, such mistakes will not happen in HMMER and Hotpep searches, as they are conserved domain and motif-based methods.
- The Gene IDs found by HMMER and Hotpep are clickable, which will open the protein domain/peptide display page: 1) dbCAN domains identified by HMMER; 2) signature peptides (6-mer) identified by Hotpep are highlighted in the sequence; 3) each signature peptide is shown with their positions and CAZyme families. Note that some proteins are only found by HMMER or by Hotpep.
3) HMMER
- This tab displays the results of the HMMER run versus the dbCAN database. The full output is avaliable for download via a link at the top of the tab.
4) DIAMOND
- This tab displays the results of the DIAMOND blast versus the CAZy database. The full output is avaliable for download via a link at the top of the tab.
5) Hotpep
- This tab displays the results of the Hotpep run versus the PPR database. The full output is avaliable for download via a link at the top of the tab. The second column (CAZy Family) contains links to the conserved peptides file for appropriate CAZy family. These files are a part of the PPR database.
6) CGC-Finder
- 1) This tab dispalys the output of CGC-Finder, if the user chose to run CGC-Finder. 2) Several files are avaliable for download at the top of this tab. The full input and output files for CGC-Finder are avaliable, along with the full DIAMOND outputs that were used to annotate genes as TCs (transporters) or TFs (transcription factors).
- CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter/TC gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates.
- 3) Rerun CGC-Finder: At the bottom of the CGC-Finder tab, users can choose to rerun CGC-Finder with customized settings. The distance setting is the maximum number of non-signature genes allowed between signature genes. The signature genes setting is which signature genes are required to be in a cluster in order for the cluster to be annotated as a CGC. The CGC-Finder rerun is superfast, and the page will return back to Overview; just clicking on the CGC-Finder tab to view the new CGC result
- The individual CGC page: clicking on the CGC ID will open a new page: 1) CGC plot made by our GCPU (gene cluster plot utility) program; 2) the PDF of the plot and the text format of the CGC can be downloaded; 3) the detailed genes and their genomic locations, including the distance of a signature gene from its upstream signature gene (Upstream distance) and the distance from its downstream signature gene (Downstream distance), as well as their best DIAMOND hits in the CAZy, TF and TC databases.