Abstract:
Functional genomics encompasses the ’new biology’
that follows genome sequencing – such as assigning gene functions;
understanding protein expression and processing; as well as modeling
reactions in metabolism, signal transduction, and other networks.
At the conference, researchers showed how they use
comparative analysis of all the available genomes, reconstruction
of the cell’s metabolic network, and structural mapping of the
unassigned genes in order to increase the annotation rate and
find more new genes.
Several talks from universities participating in the
large Danish Biotechnology Instrument Center (DABIC) grant demonstrated
the impressive technique development and buildup of instrumentation
that it has made possible. In the proteomics field in particular,
impressive scientific accomplishments were also presented.
A major achievement for industrial biotechnology was
the genome sequencing of A.niger, which open many possibilities
to find new industrial enzyme products as well as opportunities
for optimizing recombinant protein production in this industrially
useful host.
Led by the genome sequencing programs, a massive gathering
of information has characterized the field of biotechnology over
the last several years. This has made necessary a whole new set
of scientific apporaches to store and communicate the data, to
analyse its significance, and, importantly, to put it into a relevant
biological context. ’Functional genomics’ is the best common descriptor
of all the systematic, information-intensive approaches for analysis
of this data ranging from gene sequencing to mathematical models
of entire cells.
In May, a record crowd of 187 gathered in Munkebjerg,
Vejle for a two-day conference on the ’Impact of Functional Genomics
on Biotechnology’. This, the 8th annual ’Danish Biotechnology
Conference’, is aimed for scientists working in Denmark and the
surrounding regions. It is organized by Danish Biotechnology
Forum (DBF) – this year with support from DABIC.
Genome analysis:
The scientific program was kicked off by Niels Larsen
from Integrated Genomics, USA - a company which focuses on sequencing,
expression analysis, and gene annotation. With 60 prokaryotic
genomes sequenced presently, he predicted the sequencing would
increase to 100-200 genomes/year in 2010 – even in the absence
of technological leaps in sequencing such as nanopore sequencing
or others. With so much information available, he proposed that
data analysis would be the most fruitful (and profitable) area
to focus on.
In a thought-provoking slide, he showed the well-known
’Boehringer chart’ of metabolic pathways including 800 reactions,
and explained that for 22% of the reactions, no gene is known,
while for 17% only a prokaryotic gene is known. The high frequency
of unknown genes underscores the gap between the high level of
genome sequence information and the relatively limited understanding
of cellular processes. There are still many fundamentally important
gene functions to be discovered by innovative assignment algorithms.
The high proportion of genes specific for prokaryotes represent
an interesting set of targets for novel antibiotics, which due
to their action on prokaryotic genes or gene products would be
selective for the infectious agent while harmless to the patient.
Having set this stage for the importance of assignment
Niels Larsen illustrated several clever ways to use the availability
of many genomes to extract annotations which would not be possible
from mere sequence homology analysis. These included incorporating,
into the genome analysis, the strain’s biochemistry and data on
the clustering of open reading frames in the genome.
For instance, Integrated Genomics incorporate general
metabolic network information in order to identify which pathways
are represented in a given genome, suggesting that pathway genes,
which cannot readily be identified in the genome most likely will
be found in the unannotated fraction. Further, they link the genome
analysis to metabolite information specific for that organism,
as it is often found that genes for proteins acting on similar
metabolites reside closely in the genome. In the case of Termotoga
maritima – a thermophilic bacteria which is the source of many
interesting leads for industrial enzymes – the application of
these methods lead to functional assignment of an additional 10
% of the genes compared to simple sequence homology searches (Nucl.
Acid. Res. 2000, 28, 123-125).
Lars Juhl Jensen from Søren Brunak’s group at DTU,
Lyngby gave a concise and assertive talk, in which he highlighted
the group’s achievements in sequence-based functional assignment.
A simple property like gene length is selectively associated with
the group of transport and binding proteins, which could be predicted
at sensitivity of 0,9 and with only 10 % false positive. However,
the approach came up short for prokaryotes, which in my opinion
underlined the need to move beyond single gene analysis, and include
gene-specific biochemical information and genome localisation
for multiple genomes, as was highlighted in the first talk.
Julian Gough from MRC Laboratory of Molecular Biology
in Cambridge, UK brought the audience into the three dimensional
world of structural genomics. His aim is to protein sequence
information to predict which structural superfamily it belongs
to. In many cases, this information leads to implied functional
annotation of otherwise unannotated genes, as proteins from the
same superfamily usually have identical or related functions.
His approach involves building a library of so-called
Hidden Markov Models (HMM) representing each of the more than
1000 superfamilies of protein structures classified in the SCOP
database of protein motifs. All available protein structures were
used to build the models. Thus a sequence of interest can be matched
to the superfamily to which it belongs by searching for similarity
to the underlying HMM’s rather than by simple homology searching
to the members of the superfamilies. This improves the number
of hits – especially among the members of a superfamily which
are more distantly related to the other members.
The method is fast enough to analyse entire genomes,
leading to classification of close to half the genes of a genome
to the appropriate SCOP superfamily. At the structural level,
this showed that eucaryotes contain genes belonging to around
600 different superfamilies, with 97% identity of the superfamilies
represented in the human genome compared to yeast or plant genomes!
Bacteria, predictably, use fewer of the superfamily motifs for
their proteins.
Focusing on the unannotated parts of the genomes,
he showed that this method could identify the superfamily relationship
for several hundred of the unannotated genes in prokaryotic genomes
and likewise for several thousand unidentified genes in eucaryotic
genomes. This is equivalent to ~15% of the genes in each genome,
which were formerly designated ’unknown function’, but which now
could be at least assigned to a superfamily fold. Of these novel
assignments, approximately half (more for eucaryotes than for
prokaryotes) could have been obtained using BLAST searches to
the SCOP domains as published, hence the HMM approach significantly
increases the assignment level, especially in prokaryotic genomes.
The method is made available for submission of sequences for analysis
etc. at www.supfam.org.
Annotation of genes to different functional classes
is central to the use of genome information, however, there is
surprisingly little consistency in the way genes are annotated
in different genomes. Research groups choose different functionality
classification schemes, and even with the same set of categories,
groups may have different criteria for whether the match to a
category is significant enough to allow assignment. This inconsistency
make life more difficult for companies like Integrated Genomics,
which make a living from extracting information from cross-genome
comparisons, as highlighted by Niels Larsen. In contrast, Gough’s
method provide a consistent classification system, which is computationally
efficient enough to allow classification of all public genomes
to be carried out for a single publication (J. Mol. Biol., 2001,
313, 903-919). It will be interesting to see what can be achieved
by combining the approaches taken by these two speakers.
Danish Biotechnology Instrument Center (DABIC)
Established in Jan 2001, this 135M DKK government
initiative links five universities (DTU, SDU, AU, KU, KVL) to
enforce centers of excellence in biotechnology instrumentation.
The facilities include structural analysis, nucleotide sequence
analysis, proteomics, bioinformatics, microelectronics, pathway
analysis, animal genetics, and bioimaging. The build-up will be
complete with the inauguration next year of the last of the five
Cassiopeia beamlines for macromolecular structure determination
at the synchroton at the MAX2-lab in Lund – an aspect of DABIC
for which Sine Larsen (University of Copenhagen) is responsible.
In general, the DABIC facilities are available at low cost to
industry and academics (www.dabic.dk) providing a fertile ground
for focus on the instrumentation-intensive functional genomics
studies.
Several talks highlighted development of technologies
and facilities for the DABIC grant, including Sine Larsen and
Flemming Poulsen (University of Copenhagen) who spoke about structure
solving by x-ray diffraction and NMR spectroscopy respectively
and Jens Nielsen (DTU) who explained his group’s ambitious ’systems
biology’ approach to model stoichiometrically the metabolism of
the entire cell represented by 1200 reactions, 700 genes, and
700 metabolites in 3 distinct compartments – the cytosol, mitochondria
and extracellular. The model has been used for solving several
problems in central metabolism and in amino acid sensing, however,
there are still many, many unknown aspects as 20% of the reactions
are not associated with a gene yet, and the model does not represent
reaction kinetics at all – only stoichiometry. Perhaps the most
fascinating contributions from the DABIC-funded centers, came
from the proteomics experts at University of Southern Denmark
in Odense: Ole Nørgaard Jensen elegantly and efficiently combined
gel electrophoresis with liquid chromatography and tandem MS for
identification of phophorylated peptides and proteins in signal
transduction events; while Stephen Fey told a fascinating story
about identifying crucial post-translational modifications in
the transfected cells involved in diabetes type 1.
In his introduction to DABIC, Peter Roepstorff, ironically
noted that now that the investments are almost completed, the
3-year grant period is coming to an end, leaving it up to the
individual investigator’s fundraising capabilities to maintain
the instrument park. Anybody who has negotiated a service contract
on major instruments will know that this will put a heavy drain
on grant support over the next years. Roepstorff suggested a longer
horizon on grants for this type of equipment, to secure full scientific
value for the investment.
Aspergillus niger genome sequence
In one of the scientific highlights of the conference
(especially for those of us with vested interests in producing
large amounts of protein in fungi) - Han de Winde from DSM, in
Delft disclosed some of the insights gained from having sequenced
the Aspergillus niger genome. This filamentous fungus has for
years been both a major source of new enzyme products and a widely
used host for recombinant production of heterologous proteins.
It is of central importance to the Life Sciences division of the
DSM industry group, which produces biomass, enzymes, and beta-lactams
using only three microbial work horses: Saccharomyces cerevisiae,
Penicillium chrysogenum, and Aspergillus niger.
The sequence was achieved in a consortium with Qiagen
(sequencing) and Biomax (bioinformatics). Using the Bacterial
Artificial Chromosome (BAC) technique, sequencing the 34 Mbase
genome was completed in just 15 months. In contrast to the popular
shotgun sequencing, the BAC approach is based upon individual
sequencing and assembly of hundreds of BACs each containing 50-100kb.
The genome contains 14.400 genes, of which 45% could be assigned,
leaving a major information trove for new discoveries. Even among
the assigned half of the genome, close to 400 proteases and carbohydrases
were found. These enzyme classes are of major interest as products
for food production, for animal feed, and in industrial applications,
however the most interesting applications of the genome information
may be for optimization of heterologous protein production. Such
optimization is, to a large extent, based upon accumulating beneficial
mutations through successive rounds of random mutagenesis and
screening. Identifying where the genome is actually mutated in
the improved strains is central to understanding the mechanisms
of improved protein production. This may be achieved by genome-wide
expression analysis of the different mutants – a technique which
also has large potential in the subsequent optimization of inoculation
and fermentation conditions for the selected strain – but which
requires a comprehensive array of markers for the host organism’s
genes. For that use, Han de Winde revealed that Affymetrix will
soon make available a gene chip based on the A.niger sequence
information. In an admirable initiative, the sequence information
is made available to the academic community, provided that each
group makes a specific disclosure agreement.
In the final talk of the conference Kristian Almstrup
of Copenhagen University Hospital had been selected to present
his poster on Spermatogenesis. Inspired by the fact that 40% of
Danish conscripts to military service have sperm counts that indicate
they may have fertility problems, he set out to identify affected
genes by a set of different functional genomics approaches. Using
differential display competitive PCR, he randomly amplified a
large number of gene pieces, from cDNA libraries and picked the
ones which were differentially expressed in testicular tissues
from low-sperm count mice. These genes were arranged on glass
slides for microarray analysis of gene expression during development
of these cells, and also used for in-situ hybridization probes.
The microarray data were in good agreement with the differential
display, however, the in-situ studies showed that an apparent
down-regulation in a tissue slide can cover upregulation in certain
cell types which constitute only a minor part of the tissue. Nevertheless,
the differentially expressed gene set proved useful for analysing
the effects of adding endocrine disrupter chemicals suspected
of causing reduced sperm quality.
This presentation implicitly carried a simple but
important message about using several overlapping approaches to
address a biological problem. This was appropriate advice for
an audience that for two days had been discussing the most specialized
and refined methodologies. Users of such exquisite tools inherently
are at risk of becoming too reliant on the technological advances
for their own particular tool to take other supplementary approaches
into use. While the ’new biology’ is driven forward on technological
wizardry, one must remember the virtues of traditional scientific
methodology when trying to find useful answers to biological problems.
DBC IX:
With a new topics for each year’s Danish Biotechnology
Conference, a regular participant will appreciate the breadth
of biotechnology research practiced in this region. The first
few conferences focused on the industrial and food applications
of biotechnology, however, in recent years the congress program
has been increasingly open to technology topics, including their
application in medicine. DBC IX will be held May 22-23, 2003 with
the topic ’Eukaryotic Cell Factory and Protein Expression’.