Pathway/Genome Database Concepts Guide

2 What Mechanisms Exist for Accessing PGDB Data?

4  PGDB Reactions
4.1  Reaction Types
4.2  Reaction Direction
4.3  EC Numbers
4.4  Reaction Locations and Compartments of Metabolites
4.5  Reaction Balancing and Protonation State in PGDBs
4.5.1  Background and Motivations
4.5.2  Protonation State Normalization
4.5.3  Computational Reaction Balancing for Hydrogen
4.5.4  Statistics on Reaction Balance and Protonation
4.6  Reaction Atom Mappings
4.6.1  Downloading Atom Mappings
4.6.2  Atom Mapping Encoding
4.6.3  Atom Mapping Data using SMILES
4.6.4  Canonicalization of Atom Mapping Encoding

5 Computing Gibbs Free Energy of Compounds and Reactions in MetaCyc

6  PGDB Pathways
6.1  How are Pathway Boundaries Defined?
6.2  Super Pathways and Base Pathways
6.3  Pathway Variants
6.4  Conspecific and Chimeric Pathways
6.5  Do We Force a Pathway View of the Metabolic Network?

7  Ontologies Used in PGDBs
7.1  Evidence Code Ontology
7.2  Gene Ontology
7.3  Cell Component Ontology
7.4  Enzyme Commission Ontology
7.5  Pathway Ontology
7.6  Regulatory Mechanism Ontology

8  Quality Checking of PGDB Data
8.1  Editor Tool Quality Checks
8.2  Consistency Checker Checks
8.3  Checks on Newly Generated PGDBs

9 How to Learn More

1 Overview

This document describes concepts involved in Pathway/Genome Databases (PGDBs) managed by the Pathway Tools software, such as those in the BioCyc PGDB collection.

2 What Mechanisms Exist for Accessing PGDB Data?

PGDB data is accessible in several ways, which are described in more detail on the downloads page.

Query and visualization access is available through the BioCyc Web site and other Pathway Tools based Web sites.
Data files for PGDBs are available for download in multiple file formats.
SRI provides a downloadable "software/database bundle" that couples Pathway Tools with selected BioCyc PGDBs. It supports querying, visualization, and analysis of PGDBs in both a desktop mode of operation, and it will operate as a Web server. It also allows users to create their own Pathway/Genome Databases. The software/database bundle includes functionality not available through the BioCyc Web site.
Web services support many PGDB data queries and visualization operations.
The software/database bundle also allows users to query PGDB data via APIs for the Python, Java, Perl, and Common Lisp languages.

3 Pan-Genome PGDBs

As DNA sequencing has become very affordable, many fairly similar strains have been sequenced for some popular organisms of research interest, leading to an explosion of mostly similar data, but hopefully containing some interesting differences somewhere. A Pan-Genome PGDB collects similar strains into one combined PGDB, to highlight the core genomic essence of a species, and to show what fringe parts of the strain genomes are undergoing active evolution and diversification. The core, shared genes are determined by orthology.

The following steps are taken to construct a Pan-Genome PGDB:

An empty PGDB is created and initialized with the schema from MetaCyc.
A set of strain PGDBs, which have orthologs with each other, has to be chosen. For example, all strains under a species in a taxonomic classification could be chosen.
A so-called lead PGDB has to be determined among the set of strain PGDBs. Usually, many strains of a species have been sequenced because historically, an important model organism was investigated, which was sequenced numerous years ago. This historic strain would make the natural candidate for being the lead organism of the pan-genome PGDB. Due to the status as a model organism, its PGDB also may have received quite a bit of curation by an expert and thus likely contains high quality data.
Into the Pan-Genome PGDB, import from the lead organism’s PGDB all replicons, genes, proteins, reactions, and pathways.
Visit each other strain PGDB from the chosen set, and check for each gene in turn, whether it encodes a protein and whether it is orthologous to any of the genes already residing in the Pan-Genome PGDB. If an ortholog match is found, record this fact for the gene in the Pan-Genome PGDB by adding a link pointing to the ortholog.

If no ortholog was found, then import the gene from the other strain PGDB, along with its proteins and any reactions and pathways that are not yet in the Pan-Genome PGDB. Finally, add the nucleotide sequence of the newly added gene to an "artificial replicon", which accumulates all these other genes (separated by spacers consisting of several N nucleotides).
Build the Cellular Overview.

When viewing the Cellular Overview for a Pan-Genome PGDB, two special highlighting commands are made available. Highlighting the Core Genes shows all the reactions of the genes that are shared among all the strain PGDBs, in other words, each gene has orthologs to all the other strains. Highlighting the Unique Genes shows all the reactions of the genes that have no orthologs at all, and are thus uniquely contributed by only one single strain. There is an additional set of genes that are not shared by all strains, but at least some of them. Currently, there is no simple way offered yet to show these varied levels of shared genes.

4 PGDB Reactions

This section introduces concepts relevant to PGDB reactions.

4.1 Reaction Types

PGDBs can encompass multiple types of reactions, which are classified using the PGDB reaction ontology. Most “standard” metabolic reactions are instances of the class Chemical-Reactions, which is a child of class Simple-Reactions. The main classes in the reaction ontology are as follows.

Simple-Reactions: Reactions that are considered indivisible at the current level of abstraction.
- Binding-Reactions: Binding reactions involve binding or dissociation of non-covalent bonds; they do not involve formation or dissociation of covalent bonds. Example: association of two protein monomers to form a complex.
- Chemical-Reactions: Chemical-Reactions are those for which at least one substrate molecule is chemically modified, meaning that either a chemical bond (covalent, ionic or coordination) is formed and/or broken, or that a redox modification has occurred.
- Redox-Half-Reactions: Redox half reactions are elementary reactions, in which explicitly stated electrons are reducing an oxidized molecular species. These reactions do not stand alone, because electrons do not occur freely. Instead, a half reaction must be paired with another half reaction to form a complete, overall transformation.
- Transport-Reactions: Transport reactions are reactions in which at least one species is transported (passively or actively) across a membrane. The species may or may not be chemically modified in the course of the reaction.
Complex-Processes: Instances of this type of reaction are known to be multi-step processes, but where the multiple steps are not currently represented. Example: Transcription.
Unknown-Conversions: Reactions for which it is not known whether they are simple reactions or complex processes.

4.2 Reaction Direction

How do PGDBs handle reaction direction?

In a PGDB, each reaction is stored as an object that is an instance of the Reactions class. That object includes two slots, Left and Right, each of which contains a list of chemicals that are the reactants and products (not necessarily respectively) of the reaction. That is, in some cases the Left slot stores the reaction reactants, in other cases the Left slot stores the products.

Following the conventions used by the IUBMB Enzyme Commission [1], the direction in which a reaction is stored in a PGDB has no implication for the physiological directionality of that reaction. In the IUBMB EC system, all reactions within a given class are written in a single consistent direction (e.g., all hydrolases are written in the hydrolysis direction). Reactions categorized by the IUBMB EC system are stored within a PGDB such that the EC left side of the reaction is stored in the Left slot.

The PGDB framework for defining reaction direction is designed both for flexibility in encoding a diverse range of biological situations, and to minimize the work curators must do to define reaction directions. The diverse biological situations to be encoded include the notions that the directionality of some reactions are invariant, whereas other reactions will have a directionality that depend on the enzyme that is catalyzing the reaction, and on the organism in which the reaction occurs.

The directionality of some reactions is explicitly stored within the PGDB. The directionality of other reactions is not stored, but is computed on demand by Pathway Tools. The best way to query the directionality of a reaction is via the slot Reaction-Direction in reaction objects. Even when no value is explicitly stored in this slot, a method attached to this slot will attempt to compute a value for the slot. Possible values of this slot are as follows. The values PHYSIOL-LEFT-TO-RIGHT, IRREVERSIBLE-LEFT-TO-RIGHT, and LEFT-TO-RIGHT mean that the Left slot should be treated as containing the reactants; the values PHYSIOL-RIGHT-TO-LEFT, IRREVERSIBLE-RIGHT-TO-LEFT, and RIGHT-TO-LEFT mean that the Right slot contains the reactants.

REVERSIBLE: Reaction occurs in both directions in physiological settings.
PHYSIOL-LEFT-TO-RIGHT or PHYSIOL-RIGHT-TO-LEFT: The reaction occurs in the specified direction in physiological settings, because of several possible factors including the energetics of the reaction, local concentrations of reactants and products, and the regulation of the enzyme or its expression.
IRREVERSIBLE-LEFT-TO-RIGHT or IRREVERSIBLE-RIGHT-TO-LEFT: For all practical purposes, the reaction occurs only in the specified direction in physiological settings, because of chemical properties of the reaction.

LEFT-TO-RIGHT or RIGHT-TO-LEFT: The reaction occurs in the specified direction in physiological settings, but it is unknown whether the reaction is considered irreversible.

The software computes values of the Reaction-Direction slot by integrating information within the enzymes and pathways associated with the reaction. Consider these examples where no information is stored in the Reaction-Direction slot of reaction R:

A single enzyme E is associated with R, and the Reaction-Direction slot of the Enzymatic-Reaction that connects E with R has a Reaction-Direction slot value of LEFT-TO-RIGHT; the software infers that the directionality of R is LEFT-TO-RIGHT.
Two enzymes E1 and E2 are associated with R. E1 has a Reaction-Direction of LEFT-TO-RIGHT; E2 has a Reaction-Direction of RIGHT-TO-LEFT. The software infers that R is reversible.
R is associated with a single pathway P, and within that pathway, R proceeds in the right-to-left direction. The software infers that the directionality of R is LEFT-TO-RIGHT.

In general, if the software finds information indicating that R proceeds in both the left-to-right and the right-to-left directions, then it infers that R is reversible.

The equilibrium constant and the change in Gibbs free energy stored for the reaction (if any) refer to the direction of the reaction as stored, that is, assuming that the Left slot contains the reactants.

4.3 EC Numbers

The Enzyme Commission (EC) classifies enzymes based on the reactions they catalyze. In addition to a description of the enzymatic activity, each classified enzyme receives a descriptive and accurate name and a unique number, known as an EC number. The use of EC numbers makes it possible for scientists to refer to enzymes in a consistent and unambiguous way. In addition, by annotating genes with EC numbers, it is possible to computationally link those genes to precise enzymatic activities.

MetaCyc contains four types of EC numbers. The following list explains the differences among the different types.

Formal EC numbers. Formal EC numbers contain only numerical digits (e.g., 3.2.1.45). These numbers are fully defined by the Enzyme Commission and can be found in the ExplorEnz database. Reactions that are assigned these EC numbers can be marked as official, indicating the reaction is identical to that specified in the EC entry, or non-official, indicating that while it is known to be catalyzed by the enzyme defined in the EC entry (and by the same active site), it includes alternative substrates that are not part of the reaction(s) specified by the EC.
M-numbers. M-numbers contain numerical digits in all four fields, but the number in the fourth field is preceded by a capital M (e.g., 3.2.1.M7). These numbers are specific to MetaCyc and are assigned within MetaCyc to enzymes that do not have a formal EC number and are not in the process of being classified by the Enzyme Commission. The UniProt and BRENDA databases create related N-numbers and B-numbers, respectively.
Temporary EC numbers. Temporary EC numbers contain numerical digits in the first three fields, but alphabetic characters in the fourth field (e.g., 2.1.1.ba). These numbers are used internally by the Enzyme Commission for new entries that are being drafted, and are replaced at the end of the process by a formal number. They should not be cited, as they are short-lived.
Partial EC numbers. Partial EC numbers contain a hyphen character in one or more fields. These numbers should be used to indicate that the function of an enzyme is not well defined. For example, if sequence analysis suggests that an enzyme is a methyltransferase, yet the identity of the substrate is not known, it should be assigned the partial EC number 2.1.1.-. Ideally, this type of numbers should appear only in annotated genomes, and no such numbers should exist in MetaCyc. However, many such numbers are still found in MetaCyc due to the fact that in the past they were used where M-numbers should be used. They are being slowly converted to other types of EC numbers.

4.4 Reaction Locations and Compartments of Metabolites

An accurate representation of metabolic processes must take into account the existence of multiple cellular compartments and structures, and that the processes and metabolites are partitioned among these compartments. Even a simple cell consisting of a single compartment has a membrane boundary to the extracellular space, which is crossed by transport reactions. This section describes how Pathway Tools records the compartments where reactions and metabolites occur.

In Pathway Tools, compartments are specified using the controlled vocabulary in the CCO (Cell Component Ontology). Every CCO term is represented by a PGDB frame that is a child of the class CCO-SPACE or of the class CCO-MEMBRANE. At the time a PGDB is created, we infer the set of CCO terms applicable to the organism based on its taxonomic class (for example, eukaryotes will have all the terms associated with the cell nucleus, whereas prokaryotes will not — curators can edit the set of cell components present in an organism if it differs from the inferred default) and create instance frames for each such CCO class. In general the ids for CCO classes start with CCO-, whereas the ids for instances of CCO classes start with CCI-. In PGDBs with a single cellular architecture, it makes little effective difference whether a reaction location refers to a CCO instance frame or its parent class, but the reaction editor will not allow curators to assign reactions to non-abstract (see below) CCO classes that do not have any instances (because those components are not present in cells for that organism).

The default compartment for metabolites and reactions is assumed to be CCO-CYTOSOL in the S case, if no other information is specified in the reaction’s RXN-LOCATIONS slot that would override this default. In the T case, nothing can be assumed about the default locations, if no membrane and other mappings for CCO-IN and CCO-OUT were specified (see below). Such T reactions could not usefully become part of a reaction network model.

To avoid unnecessarily duplicating information, we store frames for metabolites and reactions once within each PGDB, even if they may simultaneously be present in more than one compartment in a PGDB. Instead, the compartments are specified by auxiliary information attached to reactions.

There are two types of reactions:

S Reactions: Reactions for which all of their substrates (meaning their reactants plus products) in the same compartment (e.g., a metabolic reaction that occurs in the cytosol). In a given PGDB, such a reaction could potentially occur in more than one compartment.
T Reactions: Reactions whose substrates occur in multiple compartments (e.g., a transport reaction with a reactant in the periplasm and a product in the cytosol). This condition can only happen at membranes, involving transport reactions or electron transfer reactions (ETRs). T reactions annotate their substrates to specify their locations. Those annotations may use the abstract directional compartments CCO-IN and CCO-OUT. These abstract compartments are mapped to the actual compartments in a given PGDB.

By default (meaning if no substrate locations are specified), S reactions are assumed to occur in the cytosol (CCO-CYTOSOL). To store non-default compartment information, reactions have a slot called RXN-LOCATIONS, the values of which differ between the S and T type reactions, as follows:

S Reactions: If the reaction occurs in a non-default compartment, or in several compartments, then the RXN-LOCATIONS slot stores for every compartment in which the reaction occurs, the corresponding identifier of the CCO term for that compartment. The metabolites in the reaction’s LEFT and RIGHT slots do not have any COMPARTMENT annotations, as those would be redundant and could cause conflicts.

The following example depicts an S-type reaction in the periplasm. Because its RXN-LOCATIONS slot specifies CCO-PERI-BAC (the periplasm), every substrate of the reaction is interpreted as occurring in the periplasm.
```
      LEFT: GLC

      RIGHT: |D-Glucose|

      RXN-LOCATIONS: CCO-PERI-BAC

      SPONTANEOUS?: T
```
T: The slot RXN-LOCATIONS contains one or more frames that are children of class CCO-MEMBRANE (or potentially symbols that have to be unique in this slot, for situations where the metabolites are in spaces that are not directly adjacent to one membrane, or when 3 spaces are involved). If the reaction has not been assigned to any particular membrane, then no value needs to be stored at all.

Additionally, each slot value in RXN-LOCATIONS will have annotations with the labels CCO-IN and CCO-OUT, and in the rare case of 3 compartments involved, also another label called CCO-MIDDLE. These annotations should be single-valued; each value should be a child of CCO-SPACE. These annotations define the mappings between the COMPARTMENT annotation values of the metabolites that are listed in the reaction’s LEFT and RIGHT slots, and the final compartments in this PGDB.

Every metabolite in the reaction’s LEFT and RIGHT slots needs to have a COMPARTMENT annotation, the value of which needs to be one of CCO-IN, CCO-OUT, or possibly CCO-MIDDLE in complex situations. The purpose of CCO-IN, CCO-OUT, and CCO-MIDDLE, is to allow one transport reaction to be mapped to multiple membranes.

Example of slot values for a transport reaction in the periplasmic membrane:
```
      LEFT:
        AMMONIUM
        ---COMPARTMENT: CCO-OUT

      RIGHT:
        AMMONIUM
        ---COMPARTMENT: CCO-IN

      RXN-LOCATIONS:
        CCO-PM-BAC-NEG
        ---CCO-IN: CCO-CYTOSOL    ---CCO-OUT: CCO-PERI-BAC
```

Whenever a reaction is transferred between PGDBs (by import or schema upgrade operations), all values in the RXN-LOCATIONS are filtered away (i.e. not copied). This prevents inapplicable compartments from being introduced into other PGDBs. In the future, PathoLogic and TIP will infer compartment information. The values of the RXN-LOCATIONS are listed in the flat files of the PGDB.

If the reaction is catalyzed by more than one enzyme (i.e. it has more than one enzymatic-reaction attached), then each value in the RXN-LOCATIONS slot must have an annotation called ENZRXNS, which has as its values the frame IDs of the corresponding enzymatic-reactions. This approach allows determining the precise compartment(s) in which the catalyzed reaction is occurring.

4.5 Reaction Balancing and Protonation State in PGDBs

4.5.1 Background and Motivations

This section addresses the state of reaction mass balance and protonation state of chemical compounds in PGDBs. Because these issues are still evolving and are influenced to a large degree by history, we include a historical discussion of these issues.

Our long-term goal is for all PGDB reactions to be fully mass balanced and charge balanced, and for all chemical compounds to be properly protonated at cellular pH. Although in some cases such a treatment may yield reactions or chemical structures that look non-traditional to biochemists, we believe this approach provides the most consistent and correct treatment. In addition, it provides a treatment that will facilitate automatic generation of flux-balance models from PGDBs.

Historically, the chemical structure data within BioCyc databases has been obtained from many different sources, including textbooks, articles from the primary research literature, and downloading from certain open databases. In the early years of the project we developed programs to check the mass balance and element balance of reactions within PGDBs. We found that these programs were extremely valuable because identification of unbalanced reactions allowed us to identify errors in both the reaction equations, and in the chemical structures. However, we also found that, because of the diverse sources from which we obtained chemical structure data, the structures were protonated inconsistently. Therefore, for many years we ignored element imbalances due to hydrogen only, while correcting imbalances due to other elements.

In 2008, we began to address the problem of inconsistent protonation to facilitate automatic generation of flux-balance models by automatically protonating unbalanced reactions. The first release of the newly mass-balanced MetaCyc and EcoCyc DBs was version 13.0 in early 2009. In 2015, improved software tools for checking charge balance were used to obtain fully mass- and charge-balanced reactions in MetaCyc. In time, other BioCyc PGDBs will become fully mass- and charge-balanced as well as they are regenerated from newer versions of MetaCyc.

The following sections describe the methodology by which the protonation-state normalization and reaction mass balancing were achieved.

4.5.2 Protonation State Normalization

For a given chemical compound, there can be atoms that will bind a variable number of hydrogen atoms, depending on their chemical structure and the pH of their environment. A term for the isomers of a compound that differ in the number of hydrogens bound to these atoms is proto-isomer. A term for the atoms with variable numbers of bonded hydrogens is the proto-isomerization centers of a compound. Oxygen, sulfur, phosphorus, and nitrogen are examples of typical proto-isomerization centers.

In order to bring a greater degree of consistency to our PGDBs, we protonated (i.e., assigned the correct number of bound hydrogens to the proto-isomerization centers of a compound) the compounds of EcoCyc with a reference pH value of 7.3, using the Marvin (version 5.1.02) computational chemistry software available from ChemAxon, Ltd [2]. The pH value of 7.3 was selected based on a paper on the measurement of cytoplasmic pH of E. coli [3]. In order to easily exchange compound data between MetaCyc and EcoCyc, MetaCyc was also protonated with a reference pH value of 7.3. This step is an approximation since MetaCyc contains reactions and compounds from many organisms and many cellular compartments.

The Marvin software calculates the protonation state of a compound’s protoisomerization centers by first determining their pK_A. The pK_A of the protoisomerization centers of a compound were obtained by computing the partial charge distribution. This, in turn, is calculated using a numerical partial differential equation solver, which computes the distribution by means of the structure of the compound, and the known electronegativities of the constituent atoms. Although we have worked with ChemAxon to improve the accuracy of their calculations to match that of experimentally-verified pK_A’s of many biochemically-relevant compounds, this calculation is still based on an approximation technique, and will not necessarily yield fully correct pK_A’s for every substance.

Some caveats about our protonation of compounds:

Some compounds are present in multiple reactions that take place in various different compartments in a cell, or across membranes, where the pH might vary from our stated value of 7.3.
For any given compound, only one proto-isomer is present in our PGDBs. We do not represent the other proto-isomers, nor do we represent the proto-isomerization reactions that inter-convert the various proto-isomers of a compound.
Sometimes a pK_A value for a proto-isomerization center is very close to the pH of the solution, and therefore there is approximately a 50 / 50 split between the relative abundance of the two proto-isomers of that compound in solution. The Marvin software will select the most likely proto-isomer based on a comparison of the floating point value of the relative abundance in such situations.
Our compounds might have a slightly different structure than what you will find for the same compound in an alternate chemical compound database. Please ensure that you are comparing the two compounds for the correct protonation state at a reference pH value of 7.3.

4.5.3 Computational Reaction Balancing for Hydrogen

Once the compounds of EcoCyc and MetaCyc were protonated, all reactions that had a mass-imbalance were identified and either balanced or labeled as unbalanceable. Some caveats about our computational reaction balancing:

One might notice some reactions that have more or less protons participating than what you would typically see depicted. This might be most evident in our EC reactions. One reason for this, beyond our computational reaction balancing, is that traditionally protons and other small, ubiquitous chemical moieties were considered auxiliary to the main function of a reaction and thus not depicted. In general, our EC reactions may vary from those specified by the Enzyme Commission by including more or fewer protons than the original reaction.
For use in FBA models, one must be aware that we are only representing one of the possibly many proto-isomers of a compound. We also do not represent the fast protonation reactions that inter-convert the proto-isomers. Thus, a FBA model that is attempting to simulate the flux of hydrogen in a PGDB may be inaccurate.

4.5.4 Statistics on Reaction Balance and Protonation

This table provides information on the small-molecule reaction balance state for both EcoCyc and MetaCyc at different points in time. The categories below represent reactions that are balanced, unbalanced, and those for which it is not possible to determine the balance state.

For the category of reactions where it is not possible to determine the balance state, these are mainly due to:

Reactions that have compound classes as substrates
Polymerization reactions. We are planning to extend our representation of polymerization reactions in the future to allow for mass and charge balance.
Reactions with substrates that lack a chemical structure
Reactions with substrates that include R-groups

	2009 Reaction Count	2015 Reaction Count
EcoCyc: Balanced Reactions	801	2007
EcoCyc: Reactions that cannot be balanced	160	456
MetaCyc: Balanced Reactions	5,098	12,742
MetaCyc: Reactions that cannot be balanced	1,143	1,475

4.6 Reaction Atom Mappings

The atom mapping of a reaction describes for each non-hydrogen atom in a reactant compound its corresponding atom in a product compound. The bonds broken and made by a reaction can be inferred from an atom mapping of that reaction. On the reaction page this correspondence is shown using matching colors for atoms and bonds, between reactants and products, or/and numbers labeling the atoms. If a bond is broken or made by the reaction, the bond is black. Some reactions have more than one atom mapping. In some cases not all mappings are not biologically relevant (e.g., the enzyme never produces them). It might also be the case that some of these multiple atom mappings are essentially the same due to symmetries within compound structures, or the inability to distinguish among some atoms. For some reactions no atom mapping are shown since none were computed for them. This lack of an atom mapping might result from several possible factors including to the complexity of the reaction (e.g., large substrates), that the reaction operates on substrates that have no structures, or the reaction is not mass balanced.

When Pathway Tools displays a reaction, it first tries to obtain an atom mapping from that reaction in its PGDB. However, typically reactions in PGDBs other than MetaCyc do not contain atom mappings (except for reactions not found in MetaCyc). Thus, the software next tries to find the same reaction in MetaCyc and use its atom mapping, if any, for that reaction.

The atom mappings in MetaCyc were computationally predicted without manual curation, but we expect a very low rate (< 3%) of errors. The approach used to compute these atom mappings was published in [4]. Essentially, this approach computes atom mappings that minimize the overall cost of bonds broken and made in the reaction, given assigned propensities for bond creation and breakage.

4.6.1 Downloading Atom Mappings

Atom mapping data are available in three ways. Note: some reactions have more than one atom mapping.

All atom mappings data for MetaCyc, as described in Section 4.6.2, may be downloaded as part of the MetaCyc downloadable data file package. See file atom-mappings.dat in the MetaCyc bundle. This file may contain more than one atom mapping per reaction.
All atom mappings data for MetaCyc, using the SMILES syntax as described in Section 4.6.3, may be downloaded as part of the MetaCyc downloadable data file package. See file atom-mappings-smiles.dat in the MetaCyc bundle. This file may contain more than one atom mapping per reaction.
On a reaction Web page, a textual representation of its atom mappings can be downloaded by using the right side menu command Download atom mapping(s) for this reaction.

4.6.2 Atom Mapping Encoding

As an example, the encoding of one atom mapping for reaction R524-RXN is:

REACTION - R524-RXN
NTH-ATOM-MAPPING - 1
MAPPING-TYPE - NO-HYDROGEN-ENCODING
FROM-SIDE - (HCO3 0 3) (CPD-69 4 6) 
TO-SIDE - (CARBAMATE 0 3) (CARBON-DIOXIDE 4 6) 
INDICES - 4 5 6 3 0 2 1 
//

Each atom mapping of a reaction is encoded using six fields:


: REACTION — the unique id of the reaction the mapping applies to
: NTH-ATOM-MAPPING — the index of that atom mapping in the set of atom mappings for that reaction
: MAPPING-TYPE — only one type is currently used, namely NO-HYDROGEN-ENCODING
: FROM-SIDE — list of compounds on the From-Side with their starting and ending indices
: TO-SIDE — list of compounds on the To-Side with their starting and ending indices
: INDICES — list of atom indices describing the permutation of atoms from the From-Side to the To-Side.

Before discussing this specific example, notice that the direction of the reaction in the PGDB (as given by the PGDB slots Left and Right) is not made use of in atom mappings. In particular, the From-Side in the mapping could be the left or the right side of the PGDB reaction. Also, the order of the compounds given by the From-Side and To-Side, and as discussed below, might not be the same as the order given by the Left side or Right side of the reaction. Reconstructing the atom mapping relative to the reaction must be performed relative to the chemical structures of the compounds, not the encoding of the reaction.

For this example we consider reaction R524-RXN (which is the unique id (i.e. frame id) of the reaction in MetaCyc). That reaction has only one atom mapping, which is identified as NTH-ATOM-MAPPING 1. The mapping type is NO-HYDROGEN-ENCODING. This mapping type tells us that the hydrogen atoms are not mapped. The From-Side compounds are bicarbonate (frame id HCO3) and cyanate (CPD-69), and the To-Side compounds are carbamate and carbon dioxide.

In this atom mapping only the non-hydrogen atoms are indexed. For the From-Side, HCO3 has 4 atoms mapped; they are indexed 0 to 3 (its one hydrogen atom is not indexed). CPD-69 has 3 atoms mapped; they are indexed 4 to 6. For the To-Side, CARBAMATE has 4 atoms mapped that are indexed 0 to 3. CARBON-DIOXIDE has three atoms mapped that are indexed 4 to 6. Note that the indices of the compounds form a continuous sequence on each side, which means that the indices of the atoms within these compounds are shifted accordingly. The example below shows this aspect more precisely.

The permutation indices (INDICES) is the final component in describing the mapping of atoms from the From-Side to the To-Side. An index integer j located at position i of INDICES gives the mapping of atom j of the From-Side to the corresponding atom i of the To-Side. For the R524-RXN example above, we have the following atom mapping (note that the index values start at 0). The first number in indices is 4 (j = 4), and it is in position 0 in the list of indices (i = 0). Since j refers to the From-Side, the 4 identifies the atom with index 4 on the From-Side. The FROM-SIDE data tells us that HCO3 spans atoms 0–3, and CPD-69 spans atoms 4–6, thus j = 4 identifies atom 4 overall in the list that spans both compounds, which is the 0th atom of CPD-69. Similarly, i = 0 identifies the 0th atom of the To-Side. The TO-SIDE data tells us that CARBAMATE spans atoms 0–3 and CARBON-DIOXIDE spans atoms 4–6, thus i = 0 refers to the 0th atom in the overall list, which is also the 0th atom of CARBAMATE. When we say in the first line below that “4 is atom 0 (C) of cyanate,” we are translating from the indexing system for the overall From-Side (4th atom in the overall list) to the indexing system for cyanate alone (0th atom within the chemical structure for cyanate).

j → i	From-Side atoms	To-Side atoms
4 → 0	4 is atom 0 (C) of cyanate →	0 is atom 0 (C) of carbamate
5 → 1	5 is atom 1 (N) of cyanate →	1 is atom 1 (N) of carbamate
6 → 2	6 is atom 2 (O) of cyanate →	2 is atom 2 (O) of carbamate
3 → 3	3 is atom 3 (O) of bicarbonate →	3 is atom 3 (O) of carbamate
0 → 4	0 is atom 0 (C) of bicarbonate →	4 is atom 0 (C) of carbon dioxide
2 → 5	2 is atom 1 (O) of bicarbonate →	5 is atom 2 (O) of carbon dioxide
1 → 6	1 is atom 2 (O) of bicarbonate →	6 is atom 1 (O) of carbon dioxide

The chemical structures of the compounds (and the numbering of atoms for each compound) are not encoded within the atom mapping. They are stored in the PGDB objects that describe each compound, and are available in several forms:

Via Pathway Tools Web services
Via Pathway Tools APIs
Via MOLfiles that can be downloaded for MetaCyc

The compound-specific atom indices used in atom mappings refer to the index of each atom in the chemical structure as stored in the PGDB, such as in the atom section of a MOL file.

4.6.3 Atom Mapping Data using SMILES

The SMILES syntax allows not only the representation of chemical structures and of reactions, but also representation of reaction atom mappings. The full description of the SMILES syntax is given at the following external link: SMILES Tutorial. We will only give a summary of that syntax and one example.

A reaction such as THIOSULFATE-SULFURTRANSFERASE-RXN is: described by the equation

   thiosulfate + hydrogen cyanide => sulfite + thiocyanate + 2 H+

has one atom mapping in MetaCyc. Using SMILES, this atom mapping is represented in the following way:

[C:2](#[N:3])[S-:4].[O-:1][S:5]([O-:1])=[O:1]>>[C:2]#[N:3].[O:1]=[S:5](=[O:1])([O-:1])[S:4]

The symbol >> separates the reactants from the products. The reactants are actually on the right and the products on the left of >>. This is correct as we do not encode the direction of the reaction using SMILES.

The first part on the left before the dot, namely [C:2](#[N:3])[S‑:4], is the compound thiocyanate, next comes [O‑:1][S:5]([O‑:1]), which represents the compound sulfite. As can be seen the dot separates the compounds. Hydrogen cyanide and thiosulfate are encoded, in that order, on the right of >>.

The atom mapping is encoded by using integer labels between square brackets. For example, [C:2] labels the carbon atom of thiocyanate and hydrogen cyanide with the integer 2, which means that these carbon atoms are mapped to each other. Each atom is uniquely labeled on each side of the reaction, which creates a one-to-one mapping between the atoms of reactants and products. Note that the hydrogen atoms are not mapped in this representation.

As mentioned in Section 4.6.1, atom mapping data based on SMILES can be obtained by downloading the file atom-mappings-smiles.dat from the MetaCyc download bundle. That file contains one reaction per line, the first element on each line is the frame id of the reaction, followed by a tab, then one or more SMILES separated by a space, one for each atom mapping of that reaction.

4.6.4 Canonicalization of Atom Mapping Encoding

The atom mapping encoding presented in the previous subsection is the result of a canonicalization (i.e., a normalization) process. You do not need to know how this process is working to use and decode the atom mappings but it might be useful if you want to create your own atom mapping encoding implementation that will result with the same atom mapping as the ones in your PGDB.

Canonicalization has one major goal: The atom mapping encoding does not depend on the manner in which the compounds and reactions are stored in a PGDB.

The encoding was also designed such that the atom mapping can be reconstructed from the compound structures without referring to the reaction.

The encoding depends on the compound InChi strings and the ordering of their atoms given by the program computing the InChi string. The process of canonicalization is as follows:

Determine the From-Side as the side of the reaction that has the compound with the (lexicographically) smallest InChi string. If the same compound occurs on both sides, it is the second smallest InChi string, and so on. The To-Side is the other side of the reaction.
For each side of the reaction, order the compounds based on their InChi string. This determines the order of all atoms for each side.
The indices giving the permutation of the atoms from the From-Side to the To-Side are then completely determined by the computed atom mapping. This permutation can be represented by a series of indices of atoms of the To-Side.

5 Computing Gibbs Free Energy of Compounds and Reactions in MetaCyc

The computation (i.e., estimation) of the standard Δ Gibbs free energy for reactions and compounds in MetaCyc, that is Δ_r G^⁄○ and Δ_f G^⁄○, respectively, was done at pH 7.3 and ionic strength 0.25. We used pH 7.3 because the computation of the protonation state of all compounds in MetaCyc used that value. The computation of the standard Gibbs free energy of change formation of compounds is first done by an estimation at pH 0 and ionic strength 0 (Δ_f G^○) based on the technique presented in [5]. This technique is based on the decomposition of the compounds into known “contribution groups”. Then, the standard Gibbs free energy at pH 7.3 and ionic strength 0.25 (Δ_f G^⁄○) is computed based on a technique developed by Robert A. Alberty [6]. In his technique, Alberty proposes to use several protonation states for some compounds, but we had to simplify it by always using only one protonation state by compound, that is, the unique one stored in MetaCyc.

For the standard Gibbs free energy of reactions, Δ_r G^⁄○, the computation is based on the Δ_f G^⁄○ values of the compounds involved in the reaction. The Δ_f G^⁄○ could not be computed for some of the compounds in MetaCyc due to the impossibility to decompose them into the contribution groups provided by the technique of [5]. Consequently, the Δ_r G^⁄○ is not computed for any reaction which has a substrate for which its Δ_f G^⁄○ is not stored in MetaCyc.

6 PGDB Pathways

This section introduces concepts relevant to PGDB metabolic and signaling pathways. Many of the issues discussed here are also explored in [7].

6.1 How are Pathway Boundaries Defined?

Pathway boundaries [7] are defined heuristically, using the judgment of expert curators. More information about curation practices is available in the Curator Guide. Curators consider the following aspects of a pathway when defining its boundaries.

What boundaries were defined historically for pathway?
When possible, we prefer to define boundaries at the 13 common currency metabolites:
Coincidence with regulatory units
Coincidence with metabolic units that are evolutionarily conserved

The preceding philosophy toward pathway boundary definition contrasts sharply with KEGG maps. KEGG maps are on average 4.2 times larger than MetaCyc pathways because KEGG tends to group into a single map multiple biological pathways that converge on a single metabolite [Pathway05] .

6.2 Super Pathways and Base Pathways

We define a super-pathway as a cluster of related pathways. Typically, a super-pathway consists of a linked set of smaller pathways that share a common metabolite. For example, the super pathway superpathway of phenylalanine, tyrosine, and tryptophan biosynthesis consists of several pathways that converge at the metabolite chorismate. The components of super-pathways include base pathways (pathways that are not themselves super-pathways), other super-pathways, and individual reactions that have not necessarily been assigned to base pathways. Those reactions typically serve to connect together the component pathways within a super-pathway.

Super-pathways are stored within each PGDB – they are not computed dynamically.

6.3 Pathway Variants

Experimentalists have elucidated variants of a given metabolic pathway in different organisms. These pathway variants [7] are defined as distinct pathway objects in MetaCyc when they differ in their constituent reactions. For example, MetaCyc contains several variations of the TCA cycle that have been observed in different organisms. Variant pathways are assigned names such as “TCA cycle variation II” and “arginine degradation III” (meaning the third form of arginine degradation in MetaCyc). Two pathways are not considered variants of one another in MetaCyc if their constituent reactions are identical. For example, if the glycolysis pathway occurs with identical constituent reactions in two different organisms, but with different enzymes (different protein sequences), these occurrences of glycolysis are not considered to be pathway variants. When copied to other Pathway/Genome databases [def] by the PathoLogic pathway prediction program, the names of variant pathways are not changed. This approach provides for consistent naming across DBs, but does have the side effect that names may seem inconsistent if considered only in the context of one DB, such as a DB that contains only “arginine degradation III” and “arginine degradation VIII”.

6.4 Conspecific and Chimeric Pathways

We define [7] a conspecific pathway as a pathway that belongs to a specific species, whereas a chimeric pathway is a pathway that combines elements from many species but does not occur in any one species in its entirety. MetaCyc contains both types of pathways, which are labeled as such.

6.5 Do We Force a Pathway View of the Metabolic Network?

No. Pathways comprise a layer defined on top of the reaction-based metabolic network. Users can choose to compute with the metabolic (reaction) network directly, ignoring the pathway layer, if they so choose. Note that every PGDB contains some metabolic reactions that are not assigned to any metabolic pathway.

7 Ontologies Used in PGDBs

This section summarizes the multiple ontologies used to structure information within PGDBs.

7.1 Evidence Code Ontology

PGDB objects contain evidence codes that describe the types of evidence that support and justify the inclusion of that entry in the PGDB. For example, evidence codes on an enzyme entry indicate the type of evidence supporting the inference that the enzyme catalyzes its associated biochemical reaction. See [8] for a general description of the PGDB evidence-code system. The evidence-code ontology is described here and the latest version of the ontology can be browsed here.

Because the data in MetaCyc are derived from the experimental literature, the vast majority of evidence codes within MetaCyc will be experimental evidence codes that identify different classes of experimental methods that support data within MetaCyc. However, the evidence-system supports a larger set of evidence types, including evidence based on computational inferences, which are used more extensively in PGDBs other than MetaCyc.

Evidence codes appear as icons in the upper-right corner of displays pages such as pathway and enzyme pages (example). For example, a computer icon in the upper-right corner of a pathway page means the presence of that pathway in that organism was predicted computationally; a flask icon in that same page location indicates that experimental evidence supports the existence of that pathway. Often a detailed evidence code has been assigned that indicates a specific type of experimental method – click on the flask icon to see a description of the type(s) of experimental method(s) used to elucidate the pathway (when available), and for associated citations (when available).

7.2 Gene Ontology

Gene Ontology (GO) terms from all three GO taxonomies (biological process, molecular function, cellular component) can be associated with polypeptides and with protein complexes in PGDBs. Ideally, GO terms that refer to the activity of a complex should be assigned to the complex directly rather than to its component gene products. However, since GO terms imported from external sources are often associated only with gene products, either is acceptable. Each GO term is represented by a class object that includes its description, links to its parent and child terms via both the is-a and part-of relationships, and links to all the objects in the PGDB that are directly annotated with that term. Each PGDB contains objects only for those GO terms actually used in the PGDB, i.e., those to which proteins are annotated plus all ancestors of those terms. The complete GO taxonomy is maintained in a separate database and updated periodically – terms from this database are imported into a PGDB if necessary in the course of annotating a protein. Pre-release consistency checks ensure that the GO terms in a PGDB match those in the master database, and any proteins annotated to obsolete terms are flagged for attention from a curator.

The assignment of a GO term to a protein should be annotated with an evidence code and citation. The evidence code is one from the Pathway Tools evidence code ontology described above, but a straightforward mapping between Pathway Tools evidence codes and Gene Ontology evidence codes allows easy conversion from one to the other. The evidence code assignment may include the GO with/from field (although it is not required and there are no restrictions on its use or format). No other GO qualifiers (e.g. not, contributes-to) are supported.

The GO terms for a protein are listed with their evidence codes and citations in a separate tab on the web protein page. Clicking on a GO term shows the page for that term, including links to all proteins annotated with that term. GO terms can be supplied as search criteria on the Gene/Protein/RNA Search page. In addition, the cellular component GO terms serve as the repository of protein cellular location information in Pathway Tools. A mapping between cellular component GO terms and Pathway Tools CCO terms (described in Section 4.4) allows the protein LOCATIONS slot (which describes the protein location using CCO terms) to be computed directly from the GO terms.

7.3 Cell Component Ontology

PGDBs refer to cellular locations in a number of contexts. When annotating the cellular locations of proteins we use Gene Ontology terms. But in several other contexts PGDBs use a Cell Component Ontology (CCO) developed at SRI. It is a bit more extensive than GO because it contains a number of additional terms not present in GO, as well as relationships not found in GO. For example, the CCO Surrounds relationship defines the topological relationships between cellular compartments. For example, a membrane X surrounds a cellular space Y if X encloses the space Y (e.g., the plasma membrane of a bacterial cell surrounds its cytosol).

CCO is used to define the source and destination compartments used by transport reactions in PGDBs, and is used to define the cellular architecture of different cell types in PGDBs, i.e., the set of cellular components defined in a given cell type as well as their topological relationships.

CCO is described here and the latest version of the CCO ontology can be browsed here.

7.4 Enzyme Commission Ontology

The Enzyme Commission system is present within PGDBs and its ontology of enzymes is used to classify PGDB reactions.

7.5 Pathway Ontology

PGDBs contain an ontology of metabolic pathways that divides pathways into classes such as biosynthetic, catabolic, and energy metabolism pathways. The latest version of the pathway ontology can be browsed here.

7.6 Regulatory Mechanism Ontology

Regulatory interactions within PGDBs are encoded using a regulatory-mechanism ontology. It includes regulatory classes such as Regulation-of-Transcription-Initiation and Transcriptional-Attenuation. The latest version of the ontology can be browsed here.

8 Quality Checking of PGDB Data

We have developed a number of procedures and software tools to ensure the quality of PGDB data.

8.1 Editor Tool Quality Checks

The interactive editing tools that curators use to update PGDBs implement a number of quality checks and warn curators when entered data does not pass these checks. For example, the reaction editor warns the user if an entered reaction is not balanced. The editors warn the user if required fields have not been entered. They ensure that entered data is of the proper type, such as allowing the user to choose from a controlled vocabulary of choices, or ensuring that specific fields are filled with objects of a specific type (e.g., the entities listed as reactants and products of a reaction must be chemical compounds). Syntax checking is also performed, such as checking that any HTML markup within mini-review text is properly formatted.

8.2 Consistency Checker Checks

The Consistency Checker is a large set of data checking tools that are run individually or as a group on a given PGDB. For example, they are run by curators shortly before each database release, on every PGDB that has been manually updated. Some of the tools identify data quality problems and report them so that curators can manually fix the problems. Other tools both find problems and automatically repair them.

The Consistency Checker includes the following tools.

Check for duplicate reaction objects. Check the elemental and charge balance of reactions. Ensure that EC numbers used within reaction objects are valid (have been neither transferred nor deleted).
Check for duplicate chemical compound objects (compounds that have the same chemical structure or name). Check chemical structures for redundant bonds and malformed bonds (e.g., bonds that refer to non-existent atoms).
Check transcription unit (operon) objects for invalid conditions such as containing no genes, and containing sets of genes that are not contiguous.
Run consistency constraints defined in the database such as checking numeric ranges of data values (for example, pH values can’t be outside the range of 0–14).
Check for database inverse links. For example, the slots Gene and Product are defined as inverses of one another, meaning that if the value of the Product slot for gene G is protein P, then the value of the Gene slot for P must be G.
Update Gene Ontology (GO) terms in each PGDB with respect to the most recently downloaded GO data files, such as ensuring that term definitions are current.
Remove stray newlines, trailing spaces, and tabs from common names and synonyms of objects. Search for malformed citations and malformed HTML tags within the text of mini-reviews.
Identify orphan objects, for example, an enzymatic activity object that does not point to both the reaction catalyzed and the enzyme catalyzing it.
Identify internal hyperlinks pointing at non-existing objects.

The Consistency Checker also includes tools that compute derived information for a PGDB, such as the following. These tools are run on both curated and uncurated PGDBs.

Import from PubMed the bibliographic information for PubMed IDs cited in the PGDB.
Compute polypeptide molecular weights from the amino-acid sequence of the protein.
Compute and cache summary statistics for the PGDB, such as the number of pathways, number of metabolites, and number of reactions.

8.3 Checks on Newly Generated PGDBs

A number of checks are performed on newly generated PGDBs. Curators examine sample PGDBs manually to look for incorrect inferences or other problems. In addition we compute statistics across newly generated PGDBs and discard PGDBs that do not contain at least 300 genes and 10 pathways — such genomes usually have such bad annotations that their data is of little value.

9 How to Learn More

BioCyc User’s Guide
EcoCyc User’s Guide
MetaCyc User’s Guide
How to Use a Pathway Tools Website
Guide to the Pathway Tools Schema and articles describing the Pathway Tools ontology [9, 10, 11]
How to download Pathway Tools and organism flat-file databases

References

[1]	Edwin C. Webb. Enzyme Nomenclature, 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. Academic Press, 1992.
[2]	Marvin Chemical Editor. http://www.chemaxon.com/products/marvin/.
[3]	J.C. Wilks and J.L. Slonczewski. pH of the cytoplasm and periplasm of escherichia coli: rapid measurement by green fluorescent protein fluorimetry. J Bacteriol, 189(15):5601–7, 2007. Epub 2007 Jun 1.
[4]	M. Latendresse, J.P. Malerich, M. Travers, and P. D. Karp. Accurate atom-mapping computation for biochemical reactions. J Chem Inf Model, 2012.
[5]	M. D. Jankowski, C. S. Henry, L. J. Broadbelt, and V. Hatzimanikatis. Group contribution method for thermodynamic analysis of complex metabolic networks. Biophys J, 95(3):1487–1499, Aug 2008.
[6]	Robert A. Alberty. Thermodynamics of Biochemical Reactions. Wiley-InterScience, John Wiley & Sons, Hoboken, New Jersey, 2003.
[7]	R. Caspi, K. Dreher, and P. D. Karp. The challenge of constructing, classifying, and representing metabolic pathways. FEMS Microbiol Lett, 345(2):85–93, 2013.
[8]	P. D. Karp, S. Paley, C.J. Krieger, and P. Zhang. An evidence ontology for use in pathway/genome databases. In R. Altman and T. Klein, editors, Proc Pacific Symposium on Biocomputing, pages 190–201, Singapore, 2004. World Scientific. http://www.ai.sri.com/pkarp/pubs/04psb-evidence.pdf.
[9]	P. D. Karp and M. Riley. Representations of metabolic knowledge. In L. Hunter, D. Searls, and J. Shavlik, editors, Proc First International Conference on Intelligent Systems for Molecular Biology, pages 207–215, Menlo Park, CA, 1993. AAAI Press.
[10]	P. D. Karp and S. M. Paley. Representations of metabolic knowledge: Pathways. In R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, editors, Proc Second International Conference on Intelligent Systems for Molecular Biology, pages 203–211, Menlo Park, CA, 1994. AAAI Press.
[11]	P. D. Karp. An ontology for biological function based on molecular interactions. Bioinformatics, 16(3):269–285, 2000.

Search

Genome

SmartTables

Metabolism

Analysis

Pathway/Genome Database Concepts Guide

Contents