Table Of ContentKumaretal.BMCBioinformatics2012,13:6
http://www.biomedcentral.com/1471-2105/13/6
DATABASE Open Access
MetRxn: a knowledgebase of metabolites and
reactions spanning metabolic models and
databases
Akhil Kumar1, Patrick F Suthers2 and Costas D Maranas2*
Abstract
Background: Increasingly, metabolite and reaction information is organized in the form of genome-scale
metabolic reconstructions that describe the reaction stoichiometry, directionality, and gene to protein to reaction
associations. A key bottleneck in the pace of reconstruction of new, high-quality metabolic models is the inability
to directly make use of metabolite/reaction information from biological databases or other models due to
incompatibilities in content representation (i.e., metabolites with multiple names across databases and models),
stoichiometric errors such as elemental or charge imbalances, and incomplete atomistic detail (e.g., use of generic
R-group or non-explicit specification of stereo-specificity).
Description: MetRxn is a knowledgebase that includes standardized metabolite and reaction descriptions by
integrating information from BRENDA, KEGG, MetaCyc, Reactome.org and 44 metabolic models into a single unified
data set. All metabolite entries have matched synonyms, resolved protonation states, and are linked to unique
structures. All reaction entries are elementally and charge balanced. This is accomplished through the use of a
workflow of lexicographic, phonetic, and structural comparison algorithms. MetRxn allows for the download of
standardized versions of existing genome-scale metabolic models and the use of metabolic information for the
rapid reconstruction of new ones.
Conclusions: The standardization in description allows for the direct comparison of the metabolite and reaction
content between metabolic models and databases and the exhaustive prospecting of pathways for
biotechnological production. This ever-growing dataset currently consists of over 76,000 metabolites participating
in more than 72,000 reactions (including unresolved entries). MetRxn is hosted on a web-based platform that uses
relational database models (MySQL).
Background and biomass composition. Already over 75 genome-scale
The ever accelerating pace of DNA sequencing and models are in place for eukaryotic, prokaryotic and
annotation information generation [1] is spearheading archaeal species [3] and are becoming indispensable for
the global inventorying of metabolic functions across all computationally driving engineering interventions in
kingdoms of life. Increasingly, metabolite and reaction microbial strains for targeted overproductions [4-7], elu-
information is organized in the form of community [2], cidating the organizing principles of metabolism [8-11]
organism, or even tissue-specific genome-scale meta- and even pinpointing drug targets [12,13]. A key bottle-
bolic reconstructions. These reconstructions account for neck in the pace of reconstruction of new high quality
reaction stoichiometry and directionality, gene to pro- metabolic models is our inability to directly make use of
tein to reaction associations, organelle reaction localiza- metabolite/reaction information from biological data-
tion, transporter information, transcriptional regulation bases [14] (e.g., BRENDA [15], KEGG [16], MetaCyc,
EcoCyc, BioCyc [17], BKM-react [18], UM-BBD [19],
Reactome.org, Rhea, PubChem, ChEBI etc.) or other
*Correspondence:[email protected]
2DepartmentofChemicalEngineering,ThePennsylvaniaStateUniversity, models [20] due to incompatibilities of representation,
UniversityPark,PA16802,USA duplications and errors, as illustrated in Figure 1.
Fulllistofauthorinformationisavailableattheendofthearticle
©2012Kumaretal;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons
AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin
anymedium,providedtheoriginalworkisproperlycited.
Kumaretal.BMCBioinformatics2012,13:6 Page2of13
http://www.biomedcentral.com/1471-2105/13/6
1) Naming Inconsistencies
2-Oxoglutarate + L-Alanine <=> Pyruvate + L-Glutamate
KEGG C00026 + C00041<=>C00022+C00025
BRENDA alpha-ketoglutarate + L-alanine<=>L-glutamate + pyruvate
Escherichia coli iAF1260 [c] : akg + ala-L --> glu-L + pyr
Acinetobacter baylyi iAbaylyi 1 GLT + 1 PYRUVATE <-> 1 2-KETOGLUTARATE + 1 L-ALPHA-ALANINE
Leishmania major iAC560 [m] : akg + ala-L -> glu-L + pyr
Mannheimia succiniciproducens PYR + GLU --> AKG + ALA
2) Elemental and charge imbalances
Balanced
KEGG (R)-Lactate + NAD+<=>Pyruvate + NADH + H+
Escherichia coliiAF1260 [c] : lac-D + nad --> h + nadh + pyr
Unbalanced
Acinetobacter baylyi iAbaylyi 1 D-LACTATE + 1 NAD <=>1 NADH + 1 PYRUVATE
3) Errors/incompleteness/ambiguity in structural information
models
s
R e
olit
b
a
et databases
m
#
# structures
Figure1Typicalincompatibilitiesandinconsistenciesingenome-scalemodelsanddatabases.Roadblockstousinggenome-scalemodels
anddatabasesincludeambiguitiesanddifferencesinnamingconventions,lackofbalancedreactions,andincompletenessofstructural
information.
A major impediment is the presence of metabolites and may contain generic side groups (e.g., alkyl groups
with multiple names across databases and models, and -R), varying degree of a repeat unit participation in oli-
in some cases within the same resource, which signifi- gomers, or even just compound class identification such
cantly slows down the pooling of information from mul- as “an amino acid” or “electron acceptor”. Over 3% of
tiple sources. Therefore, the almost unavoidable all metabolites and 8% of all reactions in the aforemen-
inclusion of multiple replicates of the same metabolite tioned databases and models exhibit one or more of
can lead to missed opportunities to reveal (synthetic) these problems.
lethal gene deletions, repair network gaps and quantify There have already been a number of efforts aimed at
metabolic flows. Moreover, most data sources inadver- addressing some of these limitations. The Rhea database,
tently include some reactions that may be stoichiometri- hosted by the European Bioinformatics Institute, aggre-
cally inconsistent [21] and/or elementally/charge gates reaction data primarily from IntEnz [24] and
unbalanced [22,23], which can adversely affect the pre- ENZYME [25], whereas Reactome.org is a collection of
diction quality of the resulting models if used directly. reactions primarily focused on human metabolism
Finally, a large number of metabolites in reactions are [26,27]. Even though they crosslink their data to one or
partly specified with respect to structural information more popular databases such as KEGG, ChEBI, NCBI,
Kumaretal.BMCBioinformatics2012,13:6 Page3of13
http://www.biomedcentral.com/1471-2105/13/6
Ensembl, Uniprot, etc., both retain their own representa- integration of metabolite and reaction data, 3) calcula-
tion formats. More recently, the BKM-react database is tion and reconciliation of structural information, 4)
a non-redundant biochemical reaction database contain- identification of overlaps between metabolite and reac-
ing known enzyme-catalyzed reactions compiled from tion information, 5) elemental and charge balancing of
BRENDA, KEGG, and MetaCyc [18]. The BKM-react reactions, 6) successive resolution of remaining ambigu-
database currently contains 20,358 reactions. Addition- ities in description.
ally, the contents of five frequently used human meta- Step 1: Source data acquisition
bolic pathway databases have been compared [28]. An Metabolite and reaction data was downloaded from
important step forward for models was the BiGG data- BRENDA, KEGG, BioCyc, BKM-react and other data-
base, which includes seven genome-scale models from bases using a variety of methods based on protocols
the Palsson group in a consistent nomenclature and such as SOAP, FTP and HTTP. We preprocessed the
exportable in SBML format [29-31]. Research towards data into flat files that were subsequently imported into
integrating genome-scale metabolic models with large the knowledgebase. All original information pertaining
databases has so far been even more limited. Notable to metabolite name, abbreviations, metabolite geometry,
exceptions include the partial reconciliation of the latest related reactions, catalyzing enzyme and organism
E. coli genome scale model iAF1260 with EcoCyc [32] name, gene-protein-reaction associations, and compart-
and the aggregation of data from the Arabidopsis thali- mentalization was retained. For all 44 initial genome-
ana database and KEGG for generating genome-scale scale models listed, the online information from the cor-
models [33] in a semi-automated fashion. Additionally, responding publications was also imported. The source
ReMatch integrates some metabolic models, although its codes for all parsers used in Step 1 are available on the
primary focus is on carbon mappings for metabolic flux MetRxn website.
analysis [34]. Also, many metabolic models retain the Step 2: Source data parsing
KEGG identifiers of metabolites and reactions extracted The “raw data” from both databases and models was
during their construction [35,36]. An important recent unified using standard SQL scripts on a MySQL server.
development is the web resource Model SEED that can The description schema for metabolites includes source,
generate draft genome-scale metabolic models drawing name, abbreviations used in the source, chemical for-
from an internal database that integrates KEGG with 13 mula, and geometry. The schema for reactions accounts
genome scale models (including six of the models in the for source, name, reaction string (reactants and pro-
BiGG database) [37]. All of the reactions in Model ducts), organism designation, associated enzymes and
SEED and BiGG are charge and elementally balanced. genes, EC number, compartment, reversibility/direction,
In this paper, we describe the development and high- and pathway information. Once a source has been
light applications of the web-based resource MetRxn imported into the MySQL server, a data source-specific
that integrates, using internally consistent descriptions, dictionary is created to map metabolite abbreviations
metabolite and reaction information from 8 databases onto names/synonyms and structures and metabolites to
and 44 metabolic models. The MetRxn knowledgebase reactions.
(as of October 2011) contains over 76,000 metabolites Step 3: Metabolite charge and structural analysis
and 72,000 reactions (including unresolved entries) that We used Marvin (Chemaxon) to analyze all 218,122 raw
are charge and elementally balanced. By conforming to metabolite entries containing structural information (out
standardized metabolite and reaction descriptions, of a total of 322,936, including BRENDA entries). Incon-
MetRxn enables users to efficiently perform queries and sistencies were found in 12,965 entries typically due to
comparisons across models and/or databases. For exam- wrong atom connectivity, valence, bond length or stereo
ple, common metabolites and/or reactions between chemical information, which were corrected using APIs
models and databases can rapidly be generated along available in Marvin. A final corrected version of the
with connected paths that link source to target metabo- metabolite geometries was calculated at a fixed pH of
lites. MetRxn supports export of models in SBML for- 7.2 and converted into standard Isomeric SMILES for-
mat. New models are being added as they are published mat. The structure/formula used corresponded to the
or made available to us. It is available as a web-based major microspecies found during the charge calculation,
resource at http://metrxn.che.psu.edu. which effectively rounds the charge to an integer value
in accordance with previous model construction conven-
Construction and Content tions. This format includes both chiral and stereo infor-
MetRxn construction mation, as it allows specification of molecular
The construction of MetRxn largely followed the follow- configuration [38-40]. Metabolites were also annotated
ing steps, as illustrated in Figure 2: 1) download of pri- with Canonical SMILES using the OpenBabel Interface
mary sources of data from databases and models, 2) from Chemspider. The canonical representation encodes
Kumaretal.BMCBioinformatics2012,13:6 Page4of13
http://www.biomedcentral.com/1471-2105/13/6
Metabolite & genome-scale
databases
reaction models
information
extraction
Download /
PubChem
identify
ChemSpider
metabolite bond
connectivity
information Raw
database
Canonical / isomeric
Metabolite metabolites SMILES at pH 7.2
identity analysis
Full Partial No
atomistic atomistic atomistic
detail detail detail
Elemental & charge
reactions balancing
Reaction identity
All metabolites At least one
analysis
have full atomistic metabolite lacks
detail full atomistic detail
Disambiguation of
A + B → C + D
metabolites using ?
structure & phonetic A + B → C + E
comparisons
MetRxn
Figure 2 Flowchart outlining the construction of MetRxn. After download of primary sources of data from databases andmodels, we
integratedmetaboliteandreactiondata,followedbycalculationandreconciliationofstructuralinformation.Byidentifyingoverlapsbetween
metaboliteandreactioninformation,wegeneratedelementalandchargebalancingofreactions.TheprocedurefordevelopingMetRxnwas
iterativewithsubsequentpassesmakinguseofpreviousassociationstoresolveremainingambiguities.
only atom-atom connectivity while ignoring all confor- [41,42] to resolve the identity of 34,984 metabolites and
mers for a metabolite. Using bond connectivity informa- 32,311 reactions. Another 6,100 metabolites and 11,401
tion from the primary sources and resources such as reactions involved, in various degrees, lack of full ato-
PubChem and ChemSpider we used Canonical SMILES mistic detail in their description (e.g., use an R or × as
Kumaretal.BMCBioinformatics2012,13:6 Page5of13
http://www.biomedcentral.com/1471-2105/13/6
side-chains, are generic compounds like “amino acid” or were compared against the ones without any associated
“electron acceptor”). Over 25,000 duplicate metabolites structure. The algorithm suppresses keywords/tokens
and 27,000 reaction entries were identified and consoli- depicting stereo information such as cis, trans, L-, D-,
dated within the database. The metabolites and reac- alpha, beta, gamma, and numerical entries because they
tions present in the resolved repository were further change the phonetic signature of the synonym under
classified with respect to the completeness of atomistic investigation. In addition, the algorithm ignores non-
detail in their description. chemistry related words (e.g., use, for, experiment) that
Step 4: Metabolite synonyms and initial reaction are found in some metabolite names. Certain tokens
reconciliation such as “-ic acid” and “-ate” are treated as equivalent.
Raw metabolite entries were assigned to Isomeric PubChem and Chemspider sources were accessed
SMILES representations whenever possible. If insuffi- through the GUI so that the curator gets as much infor-
cient structural information was available for a down- mation as possible to identify the data correctly. Pho-
loaded raw metabolite then it was assigned temporarily netic matches provided clues for resolving over 159
with the Canonical SMILES and revisited during the metabolites. The iterative application of string and pho-
reaction reconciliation. Canonical SMILES retain atom netic comparison algorithms resolved as many as 8,879
connectivity but not stereo-specificity and are used as metabolites after 18 rounds of reconciliation.
the basic metabolite topology descriptors as many meta- Upon completion of this workflow, all genome-scale
bolic models lack stereo-specificity information. After models are reformatted into a computations-ready form
generating the initial metabolite associations, we identi- and Flux Balance Analysis [43] is performed on both the
fied reaction overlaps using the reaction synonyms and source model and the standardized model in MetRxn to
reaction strings along with the metabolite SMILES ascertain the ability of the model to produce biomass
representations. Directionality and cofactor usage were before and after standardization. We performed the cal-
temporarily ignored. During this step, reactions were culations using GAMS version 12.6. MetRxn is accessi-
flagged as single-compartment or two-compartment (i.e., ble through a web interface that indirectly generates
transport reactions). MetRxn internally retains the origi- MySQL queries. In order to facilitate analysis and use of
nal compartment designations, but currently only dis- the data, a number of tools are provided as part of
plays these simplified compartment designations. In MetRxn.
analogy to metabolites, reactions were grouped into
families that shared participants but in the source data Data export and display
sets occurred in different compartments or differed only MetRxn supports a number of export capabilities. In
in protonation. general, any list that is displayed contains live links to
Step 5: Reaction charge and elemental balancing the metabolite or reaction entities. These lists can con-
Once metabolites were assigned correct elemental com- sist of an entire model, data from a comparison, or
position and protonation states, reactions were charge query results. All items can be exported to SMBL for-
and elementally balanced. To this end, for charge balan- mat. In addition, the public MySQL database will be
cing we relied on a linear programming representation made available upon request. Because of licensing lim-
that minimizes the difference in the sum of the charge itations, the BRENDA database cannot be exported and
of the reactants and the sum of the charge on the pro- is not part of the public MySQL database. However, we
ducts. The complete formulation is provided in the doc- plan to provide Java source code that allows for the
umentation at MetRxn. integration of a local copy of the public MySQL data-
Step 6: Iterative reaction reconciliation base with the BRENDA database (provided upon
Reactions with one (or more) unresolved reactants and/ request).
or products were string compared against the entire
resolved collection of reactions. This step was succes- Source comparisons and visualization
sively executed as newly resolved metabolites and reac- In addition to listing the content (number of metabo-
tions could enable the resolution of previously lites, reactions, etc.) of the selected data source(s),
unresolved ones. After the first pass 164 metabolites MetRxn contains tools for comparing two or more mod-
were resolved, while subsequent passes (up to 18 for els and visualizing the results. These associations can be
some models) helped resolved a total of 8,720 entries. for metabolites or reactions. During these comparisons
Reactions with significant (but not complete) overlap- compartment information and reversibility are sup-
ping sets of reactants/products are additionally sent to pressed. Comparison tables are generated by comparing
the curator GUI including phonetic information. Briefly, the associations between the selected data source(s)
the phonetic tokens of synonyms with known structures using the canonical structures.
Kumaretal.BMCBioinformatics2012,13:6 Page6of13
http://www.biomedcentral.com/1471-2105/13/6
MetRxn Scope distribution of metabolite resolution across models and
An initial repository of reaction (i.e., 154,399) and meta- databases in MetRxn. In general, metabolites without
bolite (i.e., 322,936) entries were downloaded from 8 fully-specified structures tend to participate in a rela-
databases and 44 genome-scale metabolic models. We tively small number of reactions.
compiled a non-redundant list of 42,540 metabolites The workflow followed in the creation of the MetRxn
and 35,474 reactions (after consolidating duplicate knowledgebase identified a number of inconsistencies.
entries) containing full atomistic and bond connectivity For instance, the same metabolite name may map to
detail. Another 6,100 metabolites and 11,401 reactions molecules with different numbers of repeat units (e.g.,
have partial atomistic detail typically containing generic lecithin) or completely different structures (e.g., AMP
side-chains (R) and/or an unspecified number of poly- could refer to either adenosine monosphate or ampicil-
mer repeat units. Finally, 5,436 metabolites in metabolic lin). Notably, even for the most well-curated metabolic
models and 8,000 metabolites in databases are retained model, E. coli iAF1260 [32], we found minor errors or
with no atomistic detail. In some cases lack of atomistic omissions (a total of 17) arising from inconsistencies or
detail reflects complete lack of identity specificity (e.g., incompleteness of representation in the data culled from
electron donor) whereas in other cases even though the other sources. For example, the metabolite abbreviation
chemical species is fully defined, atomistic level descrip- arbtn-fe3 was mistakenly associated with the KEGG ID
tion is not warranted (e.g., gene product of dsbC protein and structure of aerobactin instead of ferric-aerobactin.
disulfide isomerase II (reduced)). Figure 3 shows the The number of inconsistencies is dramatically increased
(cid:61)(cid:72)(cid:68)(cid:3)(cid:80)(cid:68)(cid:92)(cid:86)(cid:3)(cid:76)(cid:53)(cid:54)(cid:20)(cid:24)(cid:25)(cid:22)
(cid:36)(cid:85)(cid:68)(cid:69)(cid:76)(cid:71)(cid:82)(cid:83)(cid:86)(cid:76)(cid:86)(cid:3)(cid:87)(cid:75)(cid:68)(cid:79)(cid:76)(cid:68)(cid:81)(cid:68)(cid:3)(cid:76)(cid:53)(cid:54)(cid:20)(cid:24)(cid:28)(cid:26)
(cid:51)(cid:86)(cid:72)(cid:88)(cid:71)(cid:82)(cid:80)(cid:82)(cid:81)(cid:68)(cid:86)(cid:3)(cid:68)(cid:72)(cid:85)(cid:88)(cid:74)(cid:76)(cid:81)(cid:82)(cid:86)(cid:68)
(cid:51)(cid:86)(cid:72)(cid:88)(cid:71)(cid:82)(cid:80)(cid:82)(cid:81)(cid:68)(cid:86)(cid:3)(cid:83)(cid:88)(cid:87)(cid:76)(cid:71)(cid:68)(cid:3)(cid:46)(cid:55)(cid:21)(cid:23)(cid:23)(cid:19)(cid:3)(cid:76)(cid:45)(cid:49)(cid:26)(cid:23)(cid:25)
(cid:37)(cid:68)(cid:70)(cid:76)(cid:79)(cid:79)(cid:88)(cid:86)(cid:3)(cid:86)(cid:88)(cid:69)(cid:87)(cid:76)(cid:79)(cid:76)(cid:86)(cid:3)(cid:76)(cid:37)(cid:86)(cid:88)(cid:20)(cid:20)(cid:19)(cid:22)(cid:3)
(cid:47)(cid:72)(cid:76)(cid:86)(cid:75)(cid:80)(cid:68)(cid:81)(cid:76)(cid:68)(cid:3)(cid:80)(cid:68)(cid:77)(cid:82)(cid:85)
(cid:36)(cid:86)(cid:83)(cid:72)(cid:85)(cid:74)(cid:76)(cid:79)(cid:79)(cid:88)(cid:86)(cid:3)(cid:82)(cid:85)(cid:92)(cid:93)(cid:68)(cid:72)
(cid:40)(cid:86)(cid:70)(cid:75)(cid:72)(cid:85)(cid:76)(cid:70)(cid:75)(cid:76)(cid:68)(cid:3)(cid:70)(cid:82)(cid:79)(cid:76)(cid:3)(cid:76)(cid:36)(cid:41)(cid:20)(cid:21)(cid:25)(cid:19)
(cid:54)(cid:68)(cid:79)(cid:80)(cid:82)(cid:81)(cid:72)(cid:79)(cid:79)(cid:68)(cid:3)(cid:87)(cid:92)(cid:83)(cid:75)(cid:76)(cid:80)(cid:88)(cid:85)(cid:76)(cid:88)(cid:80)
(cid:53)(cid:75)(cid:76)(cid:93)(cid:82)(cid:69)(cid:76)(cid:88)(cid:80)(cid:3)(cid:72)(cid:87)(cid:79)(cid:76)
(cid:48)(cid:92)(cid:70)(cid:82)(cid:69)(cid:68)(cid:70)(cid:87)(cid:72)(cid:85)(cid:76)(cid:88)(cid:80)(cid:3)(cid:87)(cid:88)(cid:69)(cid:72)(cid:85)(cid:70)(cid:88)(cid:79)(cid:82)(cid:86)(cid:76)(cid:86)(cid:3)(cid:43)(cid:22)(cid:26)(cid:53)(cid:89)
(cid:36)(cid:86)(cid:83)(cid:72)(cid:85)(cid:74)(cid:76)(cid:79)(cid:79)(cid:88)(cid:86)(cid:3)(cid:81)(cid:76)(cid:71)(cid:88)(cid:79)(cid:68)(cid:81)(cid:86)
(cid:54)(cid:68)(cid:70)(cid:70)(cid:75)(cid:68)(cid:85)(cid:82)(cid:80)(cid:92)(cid:70)(cid:72)(cid:86)(cid:3)(cid:70)(cid:72)(cid:85)(cid:72)(cid:89)(cid:76)(cid:86)(cid:76)(cid:68)(cid:72)(cid:3)(cid:76)(cid:41)(cid:41)(cid:26)(cid:19)(cid:27)
(cid:42)(cid:72)(cid:82)(cid:69)(cid:68)(cid:70)(cid:87)(cid:72)(cid:85)(cid:3)(cid:80)(cid:72)(cid:87)(cid:68)(cid:79)(cid:79)(cid:76)(cid:85)(cid:72)(cid:71)(cid:88)(cid:70)(cid:72)(cid:81)(cid:86) full
(cid:36)(cid:70)(cid:76)(cid:81)(cid:72)(cid:87)(cid:82)(cid:69)(cid:68)(cid:70)(cid:87)(cid:72)(cid:85)(cid:3)(cid:69)(cid:68)(cid:92)(cid:79)(cid:92)(cid:76) partial
(cid:72) (cid:54)(cid:75)(cid:72)(cid:90)(cid:68)(cid:81)(cid:72)(cid:79)(cid:79)(cid:68)(cid:3)(cid:82)(cid:81)(cid:72)(cid:76)(cid:71)(cid:72)(cid:81)(cid:86)(cid:76)(cid:86)(cid:3)(cid:76)(cid:54)(cid:50)(cid:26)(cid:27)(cid:22)(cid:3) none
(cid:70)
(cid:85) (cid:54)(cid:68)(cid:70)(cid:70)(cid:75)(cid:68)(cid:85)(cid:82)(cid:80)(cid:92)(cid:70)(cid:72)(cid:86)(cid:3)(cid:70)(cid:72)(cid:85)(cid:72)(cid:89)(cid:76)(cid:86)(cid:76)(cid:68)(cid:72)(cid:3)(cid:76)(cid:49)(cid:39)(cid:26)(cid:24)(cid:19)
(cid:88)
(cid:82) (cid:54)(cid:87)(cid:68)(cid:83)(cid:75)(cid:92)(cid:79)(cid:82)(cid:70)(cid:82)(cid:70)(cid:70)(cid:88)(cid:86)(cid:3)(cid:68)(cid:88)(cid:85)(cid:72)(cid:88)(cid:86)
(cid:86) (cid:51)(cid:82)(cid:85)(cid:83)(cid:75)(cid:92)(cid:85)(cid:82)(cid:80)(cid:82)(cid:81)(cid:68)(cid:86)(cid:3)(cid:74)(cid:76)(cid:81)(cid:74)(cid:76)(cid:89)(cid:68)(cid:79)(cid:76)(cid:86) BRENDA
(cid:43)(cid:68)(cid:79)(cid:82)(cid:69)(cid:68)(cid:70)(cid:87)(cid:72)(cid:85)(cid:76)(cid:88)(cid:80)(cid:3)(cid:86)(cid:68)(cid:79)(cid:76)(cid:81)(cid:68)(cid:85)(cid:88)(cid:80)
BKM-react
(cid:48)(cid:72)(cid:87)(cid:75)(cid:68)(cid:81)(cid:82)(cid:86)(cid:68)(cid:85)(cid:70)(cid:76)(cid:81)(cid:68)(cid:3)(cid:68)(cid:70)(cid:72)(cid:87)(cid:76)(cid:89)(cid:82)(cid:85)(cid:68)(cid:81)(cid:86)
(cid:48)(cid:72)(cid:87)(cid:75)(cid:68)(cid:81)(cid:82)(cid:86)(cid:68)(cid:85)(cid:70)(cid:76)(cid:81)(cid:68)(cid:3)(cid:69)(cid:68)(cid:85)(cid:78)(cid:72)(cid:85)(cid:76) ChEBI
(cid:47)(cid:68)(cid:70)(cid:87)(cid:82)(cid:69)(cid:68)(cid:70)(cid:76)(cid:79)(cid:79)(cid:88)(cid:86)(cid:3)(cid:83)(cid:79)(cid:68)(cid:81)(cid:87)(cid:68)(cid:85)(cid:88)(cid:80)(cid:3)(cid:58)(cid:38)(cid:41)(cid:54)(cid:20)
(cid:48)(cid:92)(cid:70)(cid:82)(cid:83)(cid:79)(cid:68)(cid:86)(cid:80)(cid:68)(cid:3)(cid:74)(cid:72)(cid:81)(cid:76)(cid:87)(cid:68)(cid:79)(cid:76)(cid:88)(cid:80) KEGG
(cid:54)(cid:68)(cid:70)(cid:70)(cid:75)(cid:68)(cid:85)(cid:82)(cid:80)(cid:92)(cid:70)(cid:72)(cid:86)(cid:3)(cid:70)(cid:72)(cid:85)(cid:72)(cid:89)(cid:76)(cid:86)(cid:76)(cid:68)(cid:72)(cid:3)(cid:76)(cid:47)(cid:47)(cid:25)(cid:26)(cid:21)
MetaCyc
(cid:42)(cid:72)(cid:82)(cid:69)(cid:68)(cid:70)(cid:87)(cid:72)(cid:85)(cid:3)(cid:86)(cid:88)(cid:79)(cid:73)(cid:88)(cid:85)(cid:85)(cid:72)(cid:71)(cid:88)(cid:70)(cid:72)(cid:81)(cid:86)
(cid:38)(cid:79)(cid:82)(cid:86)(cid:87)(cid:85)(cid:76)(cid:71)(cid:76)(cid:88)(cid:80)(cid:3)(cid:87)(cid:75)(cid:72)(cid:85)(cid:80)(cid:82)(cid:70)(cid:72)(cid:79)(cid:79)(cid:88)(cid:80) HMDB
(cid:48)(cid:68)(cid:81)(cid:81)(cid:75)(cid:72)(cid:76)(cid:80)(cid:76)(cid:68)(cid:3)(cid:86)(cid:88)(cid:70)(cid:70)(cid:76)(cid:81)(cid:76)(cid:70)(cid:76)(cid:83)(cid:85)(cid:82)(cid:71)(cid:88)(cid:70)(cid:72)(cid:81)(cid:86)
(cid:54)(cid:87)(cid:85)(cid:72)(cid:83)(cid:87)(cid:82)(cid:80)(cid:92)(cid:70)(cid:72)(cid:86)(cid:3)(cid:70)(cid:82)(cid:72)(cid:79)(cid:76)(cid:70)(cid:82)(cid:79)(cid:82)(cid:85) BiGGDB
(cid:38)(cid:79)(cid:82)(cid:86)(cid:87)(cid:85)(cid:76)(cid:71)(cid:76)(cid:88)(cid:80)(cid:3)(cid:68)(cid:70)(cid:72)(cid:87)(cid:82)(cid:69)(cid:88)(cid:87)(cid:92)(cid:79)(cid:76)(cid:70)(cid:88)(cid:80)
ChemSpider
(cid:49)(cid:72)(cid:76)(cid:86)(cid:86)(cid:72)(cid:85)(cid:76)(cid:68)(cid:3)(cid:80)(cid:72)(cid:81)(cid:76)(cid:81)(cid:74)(cid:76)(cid:87)(cid:76)(cid:71)(cid:76)(cid:86)
(cid:38)(cid:82)(cid:85)(cid:92)(cid:81)(cid:72)(cid:69)(cid:68)(cid:70)(cid:87)(cid:72)(cid:85)(cid:76)(cid:88)(cid:80)(cid:3)(cid:74)(cid:79)(cid:88)(cid:87)(cid:68)(cid:80)(cid:76)(cid:70)(cid:88)(cid:80) 0 10000 20000 30000 40000
(cid:48)(cid:88)(cid:86)(cid:3)(cid:80)(cid:88)(cid:86)(cid:70)(cid:88)(cid:79)(cid:88)(cid:86)(cid:3)(cid:70)(cid:68)(cid:85)(cid:71)(cid:76)(cid:82)(cid:80)(cid:92)(cid:82)(cid:70)(cid:92)(cid:87)(cid:72)
(cid:54)(cid:68)(cid:70)(cid:70)(cid:75)(cid:68)(cid:85)(cid:82)(cid:80)(cid:92)(cid:70)(cid:72)(cid:86)(cid:3)(cid:70)(cid:72)(cid:85)(cid:72)(cid:89)(cid:76)(cid:86)(cid:76)(cid:68)(cid:72)(cid:3)(cid:76)(cid:48)(cid:43)(cid:27)(cid:19)(cid:24)
Figure3Variouslevelsofstructuralinformationwasavailableformodels(main)anddatabases(inset).Foreverymodel,themajorityof
metaboliteshadfullatomisticdetail(blue).Thesmallernumberofmetaboliteswithpartialatomisticdetail(orange)suchasgeneticsidechains,
orwithnoatomisticdetail(green)suchasgeneproducts,participatedinfewreactions.
Kumaretal.BMCBioinformatics2012,13:6 Page7of13
http://www.biomedcentral.com/1471-2105/13/6
for less-curated metabolic models. We used a variety of proton had to be added to the reactants side in the reac-
procedures to disambiguate the identity of metabolites tion (R, R)-Butanediol-dehydrogenase in which butane-
lacking structural information ranging from reaction diol reacts with NAD to form acetoin. In addition, the
matching to phonetic searches. For example, in the Cor- stoichiometric coefficient of water in GTP cyclohydro-
ynebacterium glutamicum model [44], 7,8-aminopelargo- lase I was erroneously set at -2 which resulted in an
nic acid (DAPA) has no associated structural imbalance in oxygen atoms. The re-balancing analysis
information. Reaction matching found the same reaction changed the coefficient to -1 (as listed in BRENDA) and
in the E. coli iAF1260 model. added a proton to the list of reactants (absent from
BRENDA) in order to also balance charges.
C.glutamicumDAPA+ ATP+ CO2⇔ DTBIOTIN+ ADP+ PI We performed flux balance analysis (FBA) on both the
published and MetRxn-based rebalanced version of the
Acinetobacter baylyi model using the uptake constraints
iAF1260[c]: atp + co2 + dann → adp + dtbt + (3) h + pi
listed in [45] to assess the effect of re-balancing reaction
which implies that 7,8-aminopelargonic acid (DAPA) entries on FBA results. We found that the maximum
is identical to 7,8-Diaminononanoate (dann). Examina- biomass using the glucose/ammonia uptake environ-
tion of pelargonic acid and nonanoate reveals that they ment decreased by 9% primarily due to the increased
were indeed known synonyms. In many cases, we were energetic costs associated with maintaining the proton
also able to assign stereo-specific information to meta- gradient. This result demonstrates the significant effect
bolite entries in models (e.g., stipulate the L-lysine iso- that lack of reaction balancing may cause in FBA calcu-
mer for lysine). We made use of an iterative approach lations. Overall, we found that nearly two-thirds of the
that allowed us to map structures from models with models had at least one unbalanced reaction, with over
explicit links to structures (e.g. to KEGG or CAS num- 2,400 entities across all models that were either charge
bers) to models that only provided metabolite names. or elementally imbalanced. Frequently, the same reac-
Furthermore, by using a phonetic algorithm that uses tion was imbalanced in multiple models (each occur-
tokens for equivalent strings in metabolite names (e.g., rence was counted separately).
‘-ic acid’ and ‘-ate’ are equivalent) we were able to
resolve more than an additional 159 metabolites. For 2. Contrasting existing metabolic models
example, phonetic searches flagged cis-4-coumarate and At the onset of creating MetRxn, we conducted a brief
COUMARATE in the Acinetobacter baylyi model [45] preliminary study to quantify the extent/severity of nam-
as potentially identical compounds. Additional checks ing inconsistencies by contrasting the reaction informa-
revealed that indeed both metabolites should map to the tion contained in an initial collection of 34 of the most
same structure. A more complex matching example popular genome-scale models spanning 21 bacterial, 10
involved 1-(5’-Phosphoribosyl)-4-(N-succinocarboxa- eukaryotic and three archaeal organisms. Across all
mide)-5-aminoimidazole from the Bacillus subtilis branches of life, most metabolic processes are largely
model [46] and 1-(5’-Phosphoribosyl)-5-amino-4-(N-suc- conserved (e.g., glycolysis, pentose phosphate pathway,
cinocarboxamide)-imidazole from the Aspergillus nidu- amino acid biosynthesis, etc.) therefore we expected to
lans model [47]. We note that the phonetic algorithm uncover a large core of common reactions shared by all
only makes suggestions and orders the possible matches models. Surprisingly, we found that only three reactions
for the curator. Next, we detail three examples that pro- (i.e., phosphoglycerate mutase, phosphoglycerate kinase,
vide an insight into the type of tasks that MetRxn can and CO transport) were directly recognized as common
2
facilitate. across those 34 models using a simple string match
comparison. Even when examining models for only a
Utility and Discussion few bacterial organisms (Bacillus subtilis, Escherichia
1. Charge and elementally balanced metabolic models coli, Mycobacterium tuberculosis, Mycoplasma genita-
The standardized description of metabolites and lium, andSalmonella Typhimurium) simple text
balanced reactions afforded by MetRxn enables the searches recognized only 40 common reactions (out of a
expedient repair of existing models for metabolite nam- possible 262, which is the size of the M. genitalium
ing inconsistencies and reaction balancing errors. Here model). The reason for this glaring inconsistency is that
we highlight one such metabolic model repair for Acine- differing metabolite naming conventions, compartment
tobacter baylyi iAbaylyiv4 [45]. We identified that 189 designations, stoichiometric ratios, reversibility, and
out of 880 reactions are not elementally or charge water/proton balancing issues prevents the automated
balanced. Most of the reactions with charge balance recognition of genuinely shared reactions across models.
errors involved a missed proton in reactions involving Using the glucose-6-phosphate dehydrogenase reaction
cofactor pairs such as NAD/NADH. For example, a as a representative example, Table 1 reveals some of the
Kumaretal.BMCBioinformatics2012,13:6 Page8of13
http://www.biomedcentral.com/1471-2105/13/6
Table 1Representation ofglucose-6-phosphate dehydrogenase in selected metabolic models
Reaction Model/Citation
[c]:g6p+nadp<==>6pgl+h+nadph Escherichiacoli[32];Lactobacillusplantarum[48];Pseudomonasaeruginosa[49];Staphylococcusaureus
[50]
[c]:g6p+nadp–>6pgl+h+nadph Bacillussubtillis[51];Mycobacteriumtuberculosis[13];Pseudomonasputida[52];Rhizobiumetli[53];
Saccharomycescerevisiae[54]
[c]g6p+nadp<==>6pgl+h+nadph Saccharomycescerevisiae[55];Escherichiacoli[56]
[c]:f420-2+g6p–>6pgl+f420-2h2 Methanosarcinabarkeri[57]
G6P+NADP<->D6PGL+NADPH Escherichiacoli[58];Musmusculus[59];Saccharomycescerevisiae[60,61]
G6P+NADP->D6PGL+NADPH Aspergillusnidulans[47];Mannheimiasucciniciproducens[62];Streptomycescoelicolor[63]
G6P+NAD->D6PGL+NADH Helicobacterpylori[64]
C01172+C00006=C01236+C00005+ GSMmouse[35]
C00080
C00092+C00006<=>C01236+C00005 Halobacteriumsalinarum[36]
+C00080
reasons for failing to automatically recognize common particular, in the C. thermocellum model [66] charged/
reactions across selected models [13,32,35,36,47-64]. As uncharged tRNA metabolites are explicitly tracked
many as nine different representations of the same reac- whereas they are not included in the C. acetobutylicum
tion exist due to incomplete elemental and charge bal- model [65]. Surprisingly, both clostridia models are
ancing, alternate cofactor usage among different more similar, at the metabolite level, to the Bacillus sub-
organisms, and lack of universal metabolite naming con- tilis iBsu1103 model [46] rather than to each other (see
ventions. We have found that this level of discord Figure 4). Charged/uncharged tRNA metabolites
between models is representative for most metabolic account for most of the increased overlap between C.
reactions. This lack of consistency renders direct path- thermocellum and B. subtilis. Most of the reaction over-
way comparisons across models meaningless and the laps are in the amino acids biosynthesis pathways, car-
aggregation of reaction information from multiple mod- bohydrate metabolism, and nucleoside metabolism. It is
els precarious. This deficiency motivated the develop- important to note that 48 reactions in C. acetobutyli-
ment of MetRxn. Given standardization in metabolite cum, 67 reactions in C. thermocellum, and 120 reactions
naming and elementally/charge balanced reaction entries in B. subtilis lack full atomistic information (see Figure
MetRxn allows for the identification of shared reactions 3) and thus were excluded from any comparisons. It is
as well as differences between any two metabolic models possible that additional shared reactions between the
(assuming that all the metabolites in the compared reac- two models can be deduced by further examining com-
tion entries have full atomistic information). When mak- parisons between not fully structurally specified metabo-
ing the comparison of those same metabolic models, lite entries. The string/phonetic comparison algorithms
MetRxn found an additional 15 reactions in common described under Step 6 along with assisted curation
(for a total of 55 – a 38% increase) and that 142 reac- could be adapted for this task.
tions are shared by B. subtilis, E. coli and Salmonella
Typhimurium. 3. Using MetRxn to Bio-Prospect for Novel Production
The Web interface of MetRxn allows for any number Routes
of models to be simultaneously compared. As a demon- A “Grand Challenge” in biotechnological production is
stration of this capability we selected to contrast the the identification of novel production routes that allow
metabolic content of two clostridia models: Clostridium for the conversion of inexpensive resources (e.g., various
acetobutylicum [65] and Clostridium thermocellum [66]. sugars) into useful products (e.g., succinate, artemisinin)
Figure 4 shows the results in the form of a Venn dia- and bio-fuels (e.g., ethanol, butanol, biodiesel etc.).
gram. Some of the differences between the clostridia Selected production routes must exhibit high yields,
species are not surprising arising due to their differing avoid thermodynamic barriers, bypass toxic intermedi-
lifestyles (C. acetobutylicum contains solventogenesis ates and circumvent existing intellectual property
pathways and a CoB12 pathway, whereas C. thermocel- restrictions. Historically, the incorporation of heterolo-
lum contains cellulosome reactions). However, we found gous pathways relied largely on human intuition and lit-
many differences that appear to reflect different conven- erature review followed by experimentation [67,68].
tions adopted when the two models were generated Currently, rapidly expanding compilations of biotrans-
rather than genuine differences in metabolism. In formations such as KEGG [69] and BRENDA [70] are
Kumaretal.BMCBioinformatics2012,13:6 Page9of13
http://www.biomedcentral.com/1471-2105/13/6
metabolites reactions
C. acetobutylicum C. thermocellum C. acetobutylicum C. thermocellum
29 58
79 57 173 224
266 79
61 90 37 66
642 1181
B. subtilis B. subtilis
Figure4ComparisonofmetaboliteandreactionoverlapsforC.acetobutylicumandC.thermocellum,andB.subtilis.Althoughthetwo
Clostridiumorganismsaresamegenus,themodelsofthesetwospecieshadsignificantnumbersofuniquemetabolites(left)andreactions
(right),andcomparisonsrevealedthattherewasmoresimilarityinmetaboliteusagewithamodelofB.subtilisthanwitheachother.Inpart,
theseoverlapsweredrivenbytheexplicitaccountingforchargedtRNAspeciesinC.thermocellumandB.subtilismodels,whichwasalso
reflectedinthereactionoverlapsthroughreactionsinvolvingthesemetabolites.
increasingly being prospected using search algorithms to PathComp [76], Pathway Tools [77,78], MetaRoute [79],
identify biosynthetic routes to important product mole- PathFinder [80] and UM-BBD Pathway Prediction Sys-
cules. Several optimization and graph-based methods tem [81] have been used to search databases for biocon-
have been employed to computationally assemble novel version routes.
biochemical routes from these sources. OptStrain [71] We recently published [82] a graph-based algorithm
used a mixed-integer linear optimization representation that used reaction information from BRENDA and
to identify the minimal number of reactions to be added KEGG to exhaustively identify all connected paths from
(i.e. knock-ins) into a genome-scale metabolic model to a source to a target metabolite using a customized min-
enable the production of the new molecule. However path algorithm [83]. We first demonstrated the min-
the combinatorial nature of the problem poses a signifi- path procedure by identifying all synthesis routes for 1-
cant challenge to the OptStrain methodology as the butanol from pyruvate using a database of 9,921 reac-
number of reaction database entries increase from a few tions and 17,013 metabolites manually extracted from
to tens of thousands. At the expense of not enforcing both BRENDA and KEGG. Here, we re-visited the same
stoichiometric balances, graph-based algorithms have task using the full list of reactions and metabolites pre-
inherently better-scaling properties for exhaustively sent in MetRxn to assess the discovery potential of
identifying all min-path reaction entries that link a using MetRxn. Figure 5 illustrates all identified pathways
source with a target metabolite. Hatzimanikatis et. al. from pyruvate to 1-butanol before MetRxn (29, shown
[72] introduced a graph-based heuristic approach in blue) and the ones discovered after using MetRxn
(BNICE) to identify all possible biosynthetic routes from (112, shown in green). As many as 83 new avenues for
a given substrate to a target chemical by hypothesized 1-butanol production were revealed as a consequence of
enzymatic reaction rules. In addition, the BNICE frame- using the expanded and standardized MetRxn resource.
work was used to identify novel metabolic pathways for In addition, the search algorithm recovered known
the synthesis of 3-hydroxypropionate in E. coli [73]. [84-88] synthesis routes using E. coli for the production
Based on a similar approach, a new scoring algorithm of 1-butanol (shown in orange). The first pathway
[74] was introduced to evaluate and compare novel involves the fermentative transformation of pyruvate
pathways generated using enzyme-reaction rules. In and acetyl-CoA to 1-butanol using enzymes from C.
addition, several techniques such as PathMiner [75], acetobutylicum [89]. The second pathway uses ketoacid
Kumaretal.BMCBioinformatics2012,13:6 Page10of13
http://www.biomedcentral.com/1471-2105/13/6
propylamine butyl phosphate
2-oxo-methylthio
oxoglutarate butanoate
3-methylthio
acetate propianaldehyde
acetoin
L-malate acetoacetyl
CoA
acetaldehyde
suc coA
butanoyl-CoA
acetyl-CoA
1-butanal
1-butanol
pyruvate propanoyl-CoA
butanoate
crotonoyl
3-oxopropanoate CoA
acetyl phosphate
isobutyryl
CoA
2-ketovaline
hydroxy acetate
butyryl CoA
acetyl-CoA 3-acetoacyl-CoA
2-hydroxyendoic acid
oxaloacetate
citramalate b-glucose
mannitol
acetoacyl-CoA 2-ketoglutarate
glyoxylate
4-hydroxyphenylpyruvate
2-oxoacid
2-oxoglutamate
succinic semialdehyde
L-lysine
pyrroline-5-carboxylate
indolyl pyruvate
3-mercaptopyruvate
4-methyl-2-oxopentanoate
2-oxoadipate
Figure5Pathwaysfrompyruvateto1-butanol.UsingtheMetRxnknowledgebase,weidentifiedalargenumberofnewpathways(green)as
wellaspreviouslyestablishedones(orange)andthoseidentifiedfoundinapreviousstudy[82](blue).
precursors [84]. This example demonstrates how the Conclusions
biotransformations stored in MetRxn can be used to tra- MetRxn enables the standardization, correction and uti-
verse a multitude of production routes for targeted lization of rapidly growing metabolic information for
bioproducts. over 76,000 metabolites participating in 72,000 reactions
Description:mary sources of data from databases and models, 2) integration of metabolite and reaction data, 3) calcula- tion from the primary sources and resources such as