Metabolic Modeling Tutorial
discounted EARLY registration ends Dec 31, 2014
Metabolic Modeling Tutorial
discounted EARLY registration ends Dec 31, 2014
Metabolic Modeling Tutorial
discounted EARLY registration ends Dec 31, 2014
Metabolic Modeling Tutorial
discounted EARLY registration ends Dec 31, 2014
Metabolic Modeling Tutorial
discounted EARLY registration ends Dec 31, 2014

News

Update History

Information

Introduction to BioCyc
5500 Databases
Guided Tour
Pathway Tools Software
Pathway Tools Blog
Publications
Linking to BioCyc
Webinars
Contact Us

Services

Subscribe to BioCyc
Metabolic Posters
Genome Posters
Software/Database Downloads
Registry
Web Services

Guide to the BioCyc Database Collection

Contents

    1  Introduction

    2  The BioCyc Databases

    3  Methodology for Generation of Tier 3 PGDBs

    4  BioCyc Ortholog Data

    5  Discovering Functional Gene Clusters Using Genome Context Methods
        5.1  How are FunGCs Computed?
        5.2  BioCyc Ortholog Data and the Reference Genomes
        5.3  Computing Pairwise Functional Linkage Scores
        5.4  Computation of Functionally Linked Gene Clusters
        5.5  How to Use Functional Gene Cluster Information in BioCyc

    6  Adoption and Curation of BioCyc Databases

    7  Comparison of BioCyc to Other Pathway Databases

    8  How Do I Learn More About PGDBs and BioCyc?

    9  Acknowledgements

1  Introduction

This document provides an overview of the BioCyc collection of Pathway/Genome Databases.

Although its content is limited at the current time, it will expand over time to cover additional aspects of BioCyc.

The information in this document pertains to all BioCyc databases (DBs), and to most other DBs created using the Pathway Tools software. More detailed information about specific members of the BioCyc family is available as follows:

2  The BioCyc Databases

The BioCyc collection of Pathway/Genome Databases (PGDBs) provides electronic reference sources on the pathways and genomes of many organisms. BioCyc databases describe organisms with sequenced genomes. BioCyc is primarily microbial. In addition, BioCyc contains databases for humans; for important model organisms such as yeast, fly, and mouse; and for other eukaryotes whose PGDBs have been curated. One reason for collecting all these PGDBs together within BioCyc is to enable the comparative analyses that become possible when multiple PGDBs are available within one site (see Tools → Comparative Analysis). PGDBs have been created for many other organisms, including microbes, fungi, plants (see PlantCyc.org and animals [details].

The databases (DBs) within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received.

  • Tier 1 PGDBs have been created through intensive manual efforts, and receive continuous updating [details of Tier 1]. The BioCyc Tier 1 DBs are:

    • EcoCyc, which describes Escherichia coli K-12.

    • MetaCyc, which describes experimentally elucidated enzymes and metabolic pathways from more than 1,800 organisms. MetaCyc does not seek to model the complete metabolic network of any one organism, but to provide a comprehensive collection of experimental pathways.

    • AraCyc

    • HumanCyc

    • LeishCyc

    • YeastCyc

  • Tier 2 PGDBs were computationally generated by the PathoLogic program, and have undergone moderate amounts of review and updating. There are 39 PGDBs in Tier 2. [details of Tier 2]

  • Tier 3 PGDBs were computationally generated by the PathoLogic program, and have undergone no review and updating. There are 2,947 PGDBs in Tier 3. [details of Tier 3]

We encourage scientists to adopt the Tier 2 and Tier 3 DBs for ongoing curation and updating.

Most microbial PGDBs within BioCyc have been generated computationally by SRI and are regenerated every 6-12 months to take advantage of improvements in our pathway prediction algorithms and in the MetaCyc pathway database. The PGDBs within BioCyc that have been provided by outside groups are updated with variable frequencies. Usually the date on which a PGDB was generated or last updated can be determined by selecting that PGDB as the current PGDB and then viewing the page at Tools → Reports → Summary Statistics or Tools → Reports → History of Updates.

3  Methodology for Generation of Tier 3 PGDBs

Tier 3 PGDBs are generated using an automated computational pipeline. The genome data source for a specific PGDB can be determined by selecting that PGDB as current database and then executing Tools → Summary Statistics. Genome data sources include:

  1. Most Tier 3 genomes are obtained from RefSeq: ftp://ftp.ncbi.nih.gov/genomes/Bacteria

  2. Human Microbiome Project genomes are obtained from: ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria

The processing steps used to create BioCyc from an annotated genome are as follows.

  1. Each genome is converted to files in the PathoLogic format that serves as input to the PathoLogic component of the Pathway Tools software.

  2. PathoLogic generates a BioCyc PGDB for the organism (see further details below). The following PathoLogic inference components are executed: pathway predictor, pathway hole filler, operon predictor, transport inference parser.

  3. A combination of automated and manual quality assurance checks are applied to the generated PGDBs to ensure they are free of errors. Some genomes are not included in BioCyc because their data fail these quality checks, e.g., some genomes have such a small number of proteins with predicted functions that they are not included.

Although all processing steps are overseen by our staff, no human curation is applied to the Tier 3 PGDBs.

The specific processing steps used to generate Tier 3 PGDBs are as follows.

  • Predict metabolic pathways [7].

  • Predict operons [13].

  • Predict pathway hole fillers [6].

  • Predict transporter reactions [9].

  • Generate an organism-specific metabolic map diagram (cellular overview) [8].

  • Import predicted and experimental protein feature data (e.g., protein phosphorylation sites, active sites, and transmembrane regions) and Gene Ontology terms from UniProt [1]. (Note that UniProt lacks these data for some organisms.)

  • Import protein subcellular localizations from PSORTDB [14] for available organisms.

  • Compute ortholog relationships among proteins within all BioCyc PGDBs.

  • Create web links from each BioCyc genome and each BioCyc gene/protein, to as many external bioinformatics DBs as possible, including to RefSeq, UniProt, and to additional DBs via the UniProt ID. The full list of DBs to which BioCyc PGDBs can include links are:

    • Cazy, dip, disprot, interpro, mint, panther, pdb, phosphosite, pride, prints, prodom, proteinmodelportal, prosite, pfam, smart, smr, string

4  BioCyc Ortholog Data

Operationally, BioCyc uses the term “ortholog” not in the strict evolutionary sense defined by Fitch [5] but in the looser sense of genes that are likely to be counterparts of one another because they are the most closely related in a pair of organisms. Although such genes are likely to be evolutionarily related in the sense of Fitch, we do not perform a detailed evolutionary analysis to compute BioCyc orthologs.

BioCyc ortholog information is generated by running NCBI Blast (“legacy” Blast version 2.2.23) pair-wise between all proteomes of all PGDBs. We do not compute sequence similarity information for RNA genes or pseudo genes, therefore BioCyc does not currently contain orthology information for such genes.

We use an E-value cut-off of 0.001 when running Blast, with all other parameters being left at their default settings. We define two proteins A and B as orthologs if protein A from proteome PA and protein B from proteome PB are bi-directional “best” Blast hits of one another, meaning that protein B is the best Blast hit of protein A within proteome PB, and protein A is the best Blast hit of protein B within proteome PA. In rare cases, protein A might have multiple orthologs in proteome PB, as explained below.

The “best” hit of protein A in proteome PB is defined by finding the minimal E-value among all hits in proteome PB in the Blast output, and collecting all the hits for A in proteome PB that have the same minimal E-value. In other words, ties are possible, as in the case of exact gene duplications. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit with the maximum alignment amino acid residue identity. For the first method, we compare the alignment lengths among all the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum alignment length is selected. For the second method, we compare the number of identical amino acid residues in the alignments between protein A and the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum number of identical amino acid residues is selected. In the case that ties still remain (as in the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have multiple orthologs in PB, such as if multiple proteins B1, B2, etc, exist in PB, and have exactly the same regions align against protein A. BioCyc does not calculate paralogs.

Although orthologs are defined for most pairs of PGDBs in BioCyc, some PGDBs lack ortholog data. A given pair of organisms might lack orthologs within BioCyc for the following possible reasons.

  • One of the organism PGDBs might not have sequence data

  • One of the organism PGDBs may have been built by a third-party (outside SRI) – due to time constraints some third-party PGDBs do not have computed orthologs

  • The two organisms are evolutionarily distant and their genes do not have any bi-directional best Blast hits

5  Discovering Functional Gene Clusters Using Genome Context Methods

This aspect of BioCyc is meant to enable the discovery of new pathways, and to provide clues as to the functions of genes of unknown function. A functionally linked gene cluster, or FunGC, refers to a group of genes that are estimated by the system to operate in a common pathway, a common cellular process, or a common protein complex (note the term “cluster” does not imply physical clustering on a chromosome). There is no guarantee as to what type of functional linkages exist among the genes in a FunGC. FunGCs are predicted using comparative-genomics methods called genome-context methods [4312210] that search for patterns across hundreds of genomes. Details of the methods used by BioCyc are described in the next sections.

BioCyc includes FunGCs for 10 genomes as of October 2014. In later releases we plan to expand the number of genomes for which FunGCs are available.

For example, E. coli genes bioA, bioB, bioD, and bioF, which participate in the metabolic pathway for biotin biosynthesis, are all predicted by our method to form a FunGC. Another E. coli FunGC consists of genes xylF, xylH, xylA, and xylB. The xylF and xylH genes encode a xylose ABC transporter, while xylA and xylB encode a xylose isomerase and a xylulokinase, respectively, which participate in a xylose degradation pathway. The full list of E. coli FunGCs is available at http://biocyc.org/ECOLI/genome-context.

We will use E. coli gene yqeB to illustrate BioCyc FunGCs in more detail. yqeB is a gene of unknown function; it is annotated as a conserved protein with a NAD(P)-binding Rossman fold. yqeB forms a FunGC with genes xdhC (xanthine dehydrogenase, Fe-S subunit), xdhB (xanthine dehydrogenase subunit, FAD-binding domain), xdhA (xanthine dehydrogenase subunit), and paoA (aldehyde dehydrogenase, Fe-S subunit). Thus, one might infer because of its assignment to a common FunGC with these four other genes that yqeB forms a common pathway with these genes.

5.1  How are FunGCs Computed?

FunGCs are computed by a two-step process. Given all genes in a given organism,

  • Step 1: Compute a pairwise functional-linkage score between every pair of genes in the organism.

  • Step 2: Compute FunGCs by searching for highly connected sets of functionally linked genes from Step 1

In a moment we will consider these steps in more detail. But first we discuss the reliance of these methods on the ortholog data within BioCyc.

5.2  BioCyc Ortholog Data and the Reference Genomes

Computation of functional linkages between two genes G1 and G2 is based on comparative-genomics analyses of the orthologs of G1 and G2 that exist in many genomes. We will refer to a given such pair of orthologs as G1′ and G2′.

BioCyc computes orthologous genes (e.g., that G1 and G1′ are orthologs) as bidirectional best hits obtained from BLAST outputs among most (but not yet all) pairs of genomes within BioCyc.

To determine whether yqeB and xdhA are functionally linked, we examine the orthologs of those two genes across a special set of 1800 BioCyc genomes that we call the reference genomes. The reference genomes are chosen to be taxonomically diverse because selection of multiple closely related species could bias the relationships that we consider during computational of functional relatedness (e.g., because gene order will be highly conserved in highly related genomes).

5.3  Computing Pairwise Functional Linkage Scores

To estimate whether genes G1 and G2 are functionally linked we consider two factors. First is the co-occurrence of their orthologs in many genomes. Are the orthologs typically found in the same genomes? If so their pairwise functional-linkage score will be higher. But the algorithm also considers how many genomes they are found in. For example, if orthologs of G1 and G2 are found in virtually every reference genome, they would have no choice but to co-occur, thus decreasing their functional linkage score. The genome-occurrence profile for a gene is called its phylogenetic profile, and is represented as a vector of ones and zeros — a one when an ortholog of G is found in a given genome, and a zero when no ortholog of G is found in a given genome. If the phylogenetic profile vectors of G1 and G2 are similar then we infer that the genes are likely to be functionally linked. The exact phylogenetic profile method we use is called pp-pval (znorm) in [4].

The second factor used to infer functional linkage is the spatial proximity of orthologs of G1 and G2. That is, when orthologs G1′ and G2′ do co-occur in the same genome, are G1′ and G2′ frequently close to one another on the same chromosome (e.g., separated by few genes or adjacent)? If so, this conserved genome proximity also provides evidence of functional relatedness. The genomic proximity method we use is a variation of the one called gn-lnX (znorm) in [4], where correlation between organisms is taken into account, with a resulting improvement in performance. Given this improvement and the fact that both methods are fused to get an additional boost in performance, the resulting score is significantly better than the best method presented in that paper. To our knowledge, this makes the fused scores used for the system, the best genome context scores in the literature.

Pairwise functional-linkage information alone is useful in providing clues regarding the function of one gene based on the known functions of other genes that it is related to. Even more useful is to find larger sets of genes that are functionally related to one another and that therefore may function in a common pathway or biological process.

5.4  Computation of Functionally Linked Gene Clusters

The method described in this section computes Functionally Linked Gene Clusters, or FunGCs. The algorithm starts with the set of pairwise functional linkages computed in the previous step. The linkages are encoded in a graph whose nodes are the genes of a given organism, and whose edges are the inferred functional linkages among those genes. The algorithm searches the full graph to find highly interconnected subgraphs (they need not be fully interconnected), that is, sets of genes that have a high fraction of mutual functional linkages. Some genes will be removed from the set if they are not sufficiently highly interconnected to other genes in the set. The algorithm we use for computing FunGCs as highly interconnected graphs is CFinder [11].

When computing FunGCs we must make empirical choices regarding two thresholds. One is the strength of pairwise functional linkage that we deem sufficient for inclusion in the network. The other one is the clique size CFinder uses as starting point, which, in turn, determines the minimum size of the resulting clusters (currently 4). We may perform some adjustment of these thresholds over time to try to seek a proper balance between finding sufficient numbers of interesting new pairwise linkages and FunGCs, versus limiting the number of incorrect pairwise linkages and FunGCs. The current values were chosen to give a good trade-off on E. coli̇ We have performed extensive evaluation and tuning of our methods [34], and in fact the methods currently used for BioCyc have improved somewhat over the published methods. For example, our use of CFinder is a new addition; it does not require full connectivity among genes within a cluster as did the previous method. And because highly-overlapping cliques are merged by CFinder into a bigger groups, this code results in fewer groups which overlap each other much less and are also more accurate (make better predictions of the pathways or protein complexes that we know of in the target organisms). CFinder is run on a network that was created with a very high threshold on the pairwise scores of 0.9995 resulting in relatively few high-confidence groups being found. Inspection of the FunGC listing page for E. coli will show that the FunGCs present in BioCyc for genes of known function overlap very strongly with known pathways.

5.5  How to Use Functional Gene Cluster Information in BioCyc

FunGC information is available on two new BioCyc page types:

  • Genome-Context Analysis Page

  • FunGC listing page

5.5.1  Genome-Context Analysis Page

This page is available for genes with at least one pairwise functional linkage that exceeds our minimum cutoff.

Navigate to this page from the gene page by clicking on “Genome Context Analysis” in the right-sidebar menu of the gene page. Not every gene page will have such a page available.

The Genome-Context Analysis Page has two sections: the first section lists the FunGC(s) to which a gene gene has been assigned. The second section lists the pairwise functional linkages that have been computed for a gene. Not every functionally linked gene will appear in a FunGC.

The first section depicts the FunGC as a graph with edge color indicating the strength of pairwise functional linkages. The tabs to the right of the FunGC diagram produce different views of the genes within a FunGC, such as showing any known pathways and operon structures containing these genes.

5.5.2  FunGC Listing Page

The Functional Gene Clusters page lists all FunGCs available for an organism.

Navigate to this page from the Genome-Context Analysis Page for any gene, using the link near the top (“See all FunGCs for this organism”).

6  Adoption and Curation of BioCyc Databases

As genomic information becomes available for an ever growing number of organisms, it becomes essential to have up-to-date knowledge resources for organisms with large experimental communities. For example, how can an experimentalist hope to accurately interpret gene-expression microarray results without having all available knowledge about the functions of each gene product in the organism at their fingertips? We see it as essential that experts on the biology of specific organisms begin curation of organism-specific databases describing the genome and biochemical networks of those organisms.

All of the Tier 3 BioCyc PGDBs, and some of the Tier 2 PGDBs, are available for adoption. The DBs are available under an open license agreement, meaning that they are freely available to all, they may be modified, and they may be freely redistributed.

The Pathway Tools software system on which BioCyc is based provides a package of graphical editing tools that allow biologists to create and update database entities such as genes, proteins, metabolic pathways, metabolites, operons, and transcriptional regulatory interactions. Scientists can also use Pathway Tools to publish their PGDBs on their own website, in a manner analogous to the BioCyc website (which is powered by Pathway Tools).

Suggested steps for adoption of PGDBs are as follows.

  • Initiate the adoption process by contacting biocyc-support at ai dot sri dot com.

  • Begin to refine the adopted PGDB by comparing its contents against known experimental knowledge for the organism. Update the PGDB to reflect errors such as false-positive pathway predictions, and to add experimentally known pathways that are not in the PGDB. Update gene names and protein functions to reflect recent discoveries.

  • Enlist colleagues who are specialists in different aspects of the organism’s biology to help. A group of collaborating curators can all run Pathway Tools on separate computers to update a PGDB stored in a commonly accessed relational database manager such as MySQL.

  • Publish the PGDB on your website using Pathway Tools, or request that SRI include your PGDB within BioCyc. To facilitate this process, SRI will ask you to deposit your PGDB in the database registry as described in the previous step.

For more details on post-adoption care of a PGDB, see Post-Adoption Care of a BioCyc Database.

7  Comparison of BioCyc to Other Pathway Databases

Please see the comparison section of the MetaCyc Guide .

8  How Do I Learn More About PGDBs and BioCyc?

BioCyc has a number of advanced operations including a number of comparative genomics tools, programs for analysis of high throughput datasets such as gene-expression data, and metabolic network analysis tools.

The following information resources describe BioCyc in more detail.

9  Acknowledgements

BioCyc is grateful for the following groups:

  • ChemAxon for use of the Marvin chemoinformatics tool

References

[1]   The Universal Protein Resource (UniProt) 2009. Nuc Acids Res, 37(Database issue):D169–174, 2009.

[2]   A.J. Enright, I. Iliopoulos, N.C. Kyrpides, and C.A. Ouzounis. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402:86–90, 1999.

[3]   L. Ferrer, J.M. Dale, and P.D. Karp. A systematic study of genome context methods: calibration, normalization and combination. BMC Bioinformatics, 11:493, 2010.

[4]   L. Ferrer, A.G. Shearer, and P.D. Karp. Discovering novel subsystems using comparative genomics. Bioinformatics, 27:2478–85, 2011.

[5]   W.M. Fitch. Distinguishing homologous from analogous proteins. Systemic zoology, 19:99–113, 1970.

[6]   M.L. Green and P.D. Karp. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics, 5(1):76, 2004.

[7]   P.D. Karp, S.M. Paley, M. Krummenacker, M. Latendresse, J.M. Dale, T. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, T. Altman, I. Paulsen, I.M. Keseler, and R. Caspi. Pathway Tools version 13.0: Integrated software for pathway/genome informatics and systems biology. Brief Bioinform, 11:40–79, 2010. doi: 10.1093/bib/bbp043.

[8]   M. Latendresse and P. D. Karp. Web-based metabolic network visualization with a zooming user interface. BMC Bioinformatics, 12:176–84, 2011.

[9]   T.J. Lee, I. Paulsen, and P.D. Karp. Annotation-based inference of transporter function. Bioinformatics, 24:i259–67, 2008.

[10]   R. Overbeek, M. Fonstein, M. D’Souza, G.D. Pusch, and N. Maltsev. Use of contiguity on the chromosome to predict functional coupling. In Silico Biol., 1(2):93–108, 1999.

[11]   G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435:814–8, 2005.

[12]   M. Pellegrini, E.M. Marcotte, M.J. Thompson, D. Eisenberg, and T.O. Yeates. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc National Academy of Sciences, USA, 96:4285–8, 1999.

[13]   P. Romero and P.D. Karp. Using functional and organizational information to improve genome-wide computational prediction of transcription units on Pathway/Genome Databases. Bioinformatics, 20:709–17, 2004.

[14]   N. Y. Yu, J. R. Wagner, M. R. Laird, G. Melli, S. Rey, R. Lo, P. Dao, S. C. Sahinalp, M. Ester, L. J. Foster, and F. S. Brinkman. PSORTb 3.0: Improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics, 26(13):1608–15, 2010.