Help page

You are here: Home > HELP

* Browse page     click to see this sample page.

    The navigation contains Home, Download, CAZyme Gene Cluster, Metadata, Links, Help and About Us.
  • 1.1 Download
  • The CAZyme sequence and annotation data are available in the download page.
    There are three functions in this page (red boxes in below picture):
    1. You can put key words (GCF ID [NCBI genome assembly ID], species name and taxid) to search the download table
    2. You can choose to show different numbers of entries
    3. There are six properties shown for each genome including the fraction of CAZyme in the genome
    Click the GCF ID, you can download the tar.gz file; this tar ball contains two files:
    1. GCF_ID.fasta contiains the all of the CAZymes sequence of this GCF_ID.
    2. GCF_ID.txt contains the following tab-separated properties of each CAZymes:
    (GCF_ID, CAZyme_ID, Product, RefSeq_ID, Start, End, Strand, CAZyme_domains, Molecular_weight, Isoelectric_point, TMHMM_num, LipoP, Predicted_EC, MetaCyc, SignalP_cleavage_site).

  • 1.2 CAZyme Gene Cluster
  • This page shows the CAZyme Gene Clusters (CGCs) that we predicted from all the bacterial genomes. CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates.
    The CGC table shows the number of CAZymes, the number of Transporters and the number of TFs. Click on the GCF ID, one can open its CGC page.

  • 1.2.1 CAZyme Gene Cluster - GCF_ID (CGC_GCF)
  • CGC_GCF page contains two parts: (i) the first part is Jbrowser of this CGC, the yellow highlights the location range of this CGC; (ii) the second part is a table showing all the genes in the range above, CGC signature genes are highlighted with yellow color.

    Clicking one gene in the highlighted region of Jbrowser will open a pop window with its primary data, attributes, fasta sequences to download and some subfeatures.
    This function comes with Jbrowse, and is also found in the protein page.

  • 1.3 Metadata
  • Metadata are the macroscopical terms describing different biological properties. Here we downloaded all the bacteria's species metadata from JGI's Integrated Microbial Genomes (IMG) database. We then extracted genomes having at least one metadata property in our dbCAN-seq database. The properties are disease, ecosystem, ecosystem category, ecosystem type, habitat, metabolism, motility, oxygen_requirement, ph, phenotype, salinity, sample body site, sample body subsite, specific ecosystem, temperature range. All these features could be selected from a pull-down menu. When one feature is selected (e.g., disease in the red box), a bar chart will be displayed with y axis showing the different diseases and the bar height showing the number of genomes/GCFs in that disease. The bars are sorted according to height. Clicking one bar will lead to a new page with another bar chart showing the genomes/GCFs in that bar (below).

  • 1.3.1 Metadata-property
  • This bar chart is sorted according to height too. Each bar represents one GCF/genome, with the height proportional to the number of CAZymes found in the genome. Clicking one of them, we will go to the genome page.

* Search Engine    

    Introduction: A search box is shown at the top part of all pages. The search function is built upon the powerful search engine Sphinx, which has been programmatically configured to allow very fast index-organized table search (average response time: 300 millisecond) and highly efficient pagination. It implements a Google-like search supporting both exact and fuzzy query, and users can input a keyword to search 12 different data types. These data types can be largely classified into three groups:

    (i) basic information about CAZymes: such as species name, CAZyme domain, protein ID, GCF ID, taxonomy ID;
    (ii) CAZyme annotation data: such as PDB hits, Swiss-Prot hits, CAZy hits, CDD hits, E2P2 predicted enzyme reaction and EC number;
    (iii) CAZyme genomic context.

    For data types in (ii), it is to search for CAZymes sharing sequence similarity to Swiss-Prot, PDB, and CAZy proteins. For example, one can type in a PDB ID (e.g., 1LZL_A) and choose a sequence identity value (e.g, 50%). The search will return a list of CAZymes sharing similarity with the queried PDB protein with identity larger than the given value: similarity = 20%,1LZL_A. This is very useful to answer questions like, what proteins in dbCAN-seq have a high sequence similarity to some experimentally characterized CAZymes?

    As for (iii), it is a very useful tool to search the gene neighborhood of a query CAZyme. For example, users can type in a CAZyme protein ID (e.g., WP_007212487.1), and select how many upstream and downstream genes of the query gene they want to explore (e.g., 5). The search will return a table with 11 genes with the query gene being the 6th in the table: WP_007212487.1 . If any of the 11 genes are CGC signature genes (i.e., CAZyme, TF, or TC), they will be highlighted with colors. This is a novel tool to answer questions like, is my CAZyme located close to any other CAZyme genes or TF or TC genes, or is my CAZyme potentially located in any CGCs?

    1. CAZyme ID (e.g., NP_212393.2 );
    2. GCF_ID (e.g., GCF_000005825.2);
    3. Tax_ID (e.g., 398511);
    4. Species_Name (e.g.,Bacillus pseudofirmus OF4);
    5. CAZyme_domain (e.g., CE4);
    6. Pdb hit (e.g., similarity = 20%,1LZL_A);
    7. Swiss-Prot hit (e.g.,similarity = 20%,ETHA_MYCTU or P9WNF9 or sp|P9WNF9|ETHA_MYCTU);
    8. Cdd ID (e.g., similarity = 20%,COG2072);
    9. CAZyme hit (e.g., similarity = 20%,AHF23796.1);
    10. MetaCyc (e.g.,;
    11. Predited_EC (e.g.,;
    12. CAZyme ID (CGC) (e.g., WP_007212487.1);

    From 1 to 5 and 10 to 12, one can type the keyword to search.

    From 6 to 9, one has to specify a sequence identity value in addition to a keyword. For example, if choose to search pdb_hit, on the left an identity value, e.g., 20%, has to be selected, followed by a pdb ID. The result will be a list of proteins in the database that share > 20% identity to the pdb protein.

* Browse page     click here to this sample page.

  • 3 Browse by Taxonomy
  • 3.1 
  • This page shows the 31 phyla of bacteria. Choose one of phylum,the six levels of classification (phylum, class, lineage_order, family, genus, species) are shown when choose Taxonomic Group.

  • 3.2  Browse by Family
  • Genome Bar version shows the sorted number of CAZymes of each taxnomy|gcf_id, where all of the numbers are sorted. Choose one of them, we go to the "Browse by Genome" page.

  • 3.3  Genomes Bar
  • Genomes bar shows the sorted number of CAZymes of each taxnomy|gcf_id, where all of the numbers are sorted. Choose one of them, we go to the "Browse by Genome" page.

  • 3.4 metadata
  • If the genome has metadata in our database, the page will show the metadata of all lot of properties.
  • 3.5 Browse by Genome
  • This page has two tabs(Browse by Family, CAZyme list).
    The CAZymes number of each family of this Genome are caculated and shown in the bracket behind each family name.

    CAZyme lists (sample)shows the detailed information of CAZyme.
    1.Red rectangle 1 links to the protein page (sample).
    2.Red rectangle 2 links to the NCBI database(sample).
    3.Red rectangle 2 links to the NCBI taxnomy page(sample).

* Protein page     click here to this sample page.

  • 4 Browse by Family
  • We can choose different family from the fourthpart of index page, this families contains:
    - Auxiliary Activities (AAs) : redox enzymes that act in conjunction with CAZymes.
    - Carbohydrate-Binding Module (CBMs) :a contiguous amino acid sequence within a carbohydrate-active enzyme with a discreet fold having carbohydrate-binding activity.
    - Carbohydrate Esterases (CEs) : hydrolysis of carbohydrate esters.
    - Glycoside Hydrolases (GHs) : hydrolysis and/or rearrangement of glycosidic bonds.
    - GlycosylTransferases (GTs) : formation of glycosidic bonds.
    - Polysaccharide Lyases (PLs) : non-hydrolytic cleavage of glycosidic bonds.
    Choose one of them(select CMB74 as example), then go to he family page.
  • 4.1 Family page
  • This page contain all of species|genome of the CBM74 family, # of CAZymes of one species are caculated, the bar plot are sorted.
    Chooese one of them, we go to the genome_family page.
  • 4.2 Genome Family page
  • This page shows the all of the proteins of GCF_000737885.1 genome of CBM74 family.
    Click one of them, we go to the protein page(protein help page).

* Protein page     , click here to this sample page.

  • 5.1 Basic Information
  • The basic information descibes the property of CAZyme and its genome. All of them have the outer link to related databse.
  • 5.2 Genomic Context
  • Jbrowse use the GFF3 file to label the gene location of one genome.
  • 5.3 Full sequence
  • We support the full sequence of protein and the download link.
  • 5.4 Enzyme prediction
  • Ensemble Enzyme Prediction Pipeline annotates protein sequences with Enzyme Function classes comprised of full, four-part Enzyme Commission numbers and MetaCyc reaction identifiers.
  • 5.5 CAZyme Signature Domains
  • CAZyme Signature Domains These are the family cazyme domains, the sequence was labled by red rectangles, which are different doamins.
  • 5.6 CDD search
  • RPS-BLASTwas run with full-length CAZyme protein sequences as query and the NCBI CDD database (hyperlink) as the database. CDD is a protein annotation resource that contains well annotated sequence models. E-value < 1e-2 was used to keep the CDD domain match.
  • 5.7 CAZyme Hits
  • We use the Diamond search against DBCAN CAZyme Database instead of blast because of its fast search speed.
  • 5.8 PDB hits
  • Protein Data Bank protein sequence database was downloaded. Diamond was run with full-length proteins as query to search for homologous protein matches. Value <10 (E-value< 1e-5)was used to keep significant match. If there is a significant PDB match, that means the browsed protein has a close homolog with 3D structure solved. Orthologous groups.
  • 5.9 Swissprot hits
  • Swiss-Prot database was downloaded. Diamond was run with full-length proteins as query to search for homologous protein matches. Value <10 (E-value< 1e-5)was used to keep significant match. If there is a significant PDB match, that means the browsed protein has a close homolog with 3D structure solved. Orthologous groups.
  • 5.10 Signalp annotations
  • Swiss-Prot database was downloaded. Signal peptide was predicted using SignalP. Both Gram-negative bacteria and Gram-positive bacteria are selected and their predicted results are offerd on the page(Gram-positve-SP and Gram-negative-SP).
  • 5.11 TMHMM
  • Full-length sequences were taken to run TMHMMto predict the transmembrance regions.
  • 5.12 PPSpred
  • The secondary structure prediction was predicted by PPSpred.
    We use different color(red - coil, green - helix, black - beta) to represent the different location.
  • 5.13 lipoP
  • The lipoproteins prediction was predicted by lipoP.