|
|
|
|
 |
 |
| |
 |
| EMBL–EBI Hinxton,
Monday, 15 December 2003 |
 |
| UniProt consortium goes on-line |
 |
 |
 |
Press
Release 15 December 2003 [PDF]
Today the EMBL–European Bioinformatics Institute
[EMBL–EBI], the Swiss Institute of Bioinformatics
[SIB] and Georgetown University Medical Center's
Protein Information Resource [PIR] announce the
launch of UniProt, a new universal protein resource
that will be the world's most comprehensive catalogue
of information on proteins. UniProt will provide
a 'one-stop shop,' allowing easy access to all
the publicly available information on proteins.
Protein sequence databases have become a crucial resource for molecular
biologists, allowing them to analyse the proteomes of newly
sequenced organisms, to make intelligent predictions about the functions
of newly identified proteins, and to move towards understanding
how proteins interact to create pathways, networks and entire systems.
To do this efficiently they need access to a defined set of features
describing all the proteins that are known to exist or have been predicted
to exist by extrapolation from their gene sequences.
Until recently there have been two major efforts to make this information publicly
available. One was a collaboration between SIB
and EMBLÔEBI that resulted in two complementary
databases, Swiss-Prot
[renowned for providing a great depth of information
on proteins through high-quality manual curation]
and TrEMBL
[a much larger database in which information on
protein function is derived computationally by
comparison with other proteins]. The other was
the PIR-International Protein Sequence Database
[PIR-PSD], the world's first database of classified
and functionally annotated proteins. These databases
held different, but overlapping, subsets of proteins.
"The launch of UniProt is tremendously exciting
because databases that have been running independently
for years have come together for the benefit of
their users," explains Maria-Jesus Martin, Sequence
Database Group coordinator at the EBI.
This unification was made possible by funding from the
National Institutes of Health, totalling US $
15 million over 3 years.The National Human Genome
Research Institute [NHGRI] is the primary funding
institute, contributing $3 million annually. Other
NIH participants are the National Institute of
General Medical Sciences [$1 million], the National Library of Medicine [$460,00], the National Institute of Mental Health [$300,000], the National Center for Research Resources [$100,000] and the National Institute of Dental and Craniofacial Research
[$50,000].
"Scientists today must face the challenge of understanding an increasingly
large amount of data generated by the Human Genome Project and
related resources.The UniProt databases will be a critical resource for
investigators trying to unlock the secrets in genome sequences, both to
understand biology and to translate basic research into improvements in
health care," says Peter Good, Ph.D., the NHGRI programme director in
charge of the UniProt project.
The UniProt databases launched today are the result of a hectic but immensely
productive year of collaboration among the three
institutions that make up the UniProt Consortium.
"UniProt's structure resembles that of a wedding
cake," explains Rolf Apweiler, UniProt's Principal
Investigator. "Each tier of the cake represents
a different database, optimized for different
uses."
Underpinning the entire project is the UniProt
Archive [UniParc] – the most comprehensive
publicly accessible non-redundant protein sequence
database available. Protein sequences are loaded
daily from the public databases, including not
only Swiss-Prot,
TrEMBL and PIR-PSD, but also the EMBLÔBank/DDBJ/GenBank nucleotide sequence databases, the Ensembl database
of animal genomes, the International Protein Index
[IPI], the Protein Data Bank [PDB], the NCBI's Reference Sequence Collection [RefSeq], model
organism databases such as FlyBase and WormBase,
and protein sequences from the European, American,
and Japanese Patent Offices. UniParc provides
cross-references to the source databases, sequence
versions and status.
The next layer of the wedding cake – and the centerpiece
of the UniProt Consortium's activities – is the UniProt
Knowledgebase [UniProt] unified from Swiss-Prot,
TrEMBL and PIR-PSD. "This is the place to go if you want
to know everything there is to know about a specific protein,"
explains Maria-Jesus Martin. The Knowledgebase
contains a non-redundant set of entries that include information
on protein function and classification, as well as
cross-references to more than 40 other resources. The
UniProt Knowledgebase consists of two parts, one containing
fully manually annotated records and another with
computationally analysed records awaiting full manual
annotation. Sequences for which new functional, structural
and biochemical data have been published are prioritized
for annotation.The two sections will continue to be
referred to as Swiss-Prot and TrEMBL, respectively.
Researchers will also be able to submit protein sequences
directly to the Knowledgebase using a new web-based
submission tool called SPIN. SPIN replaces Swiss-Prot's email-
based submission system, making it much easier for
researchers to submit sequences. "SPIN's forms allow
researchers to submit more information about a protein's
features in a more structured way," explains Vincent
Lombard, who coordinated the development of SPIN.
"This improves the efficiency of submission for both submitters
and curators."
The top tier of the wedding cake contains three sub-layers
– UniRef100, UniRef90 and UniRef50 – collectively
known as UniRef [for UniProt non-redundant reference].
"The UniRef databases will use newly developed automatic
procedures to combine closely related sequences into a
single record," explains Cathy Wu, whose group at PIR is
responsible for their creation.Wu continues, "UniRef100 is
a non-redundant version of all the sequences in the
Knowledgebase, UniRef90 collapses all the sequences that
are 90% or more identical into a single record, and
UniRef50 collapses sequences that are at least 50% identical.
UniRef50 speeds up searching significantly and doesn't
reduce the effectiveness of homology searching. The
three UniRef databases allow the user to choose between
a fast search and a truly comprehensive one."
"With UniProt we can address some aspects of the challenges
that life scientists are currently facing," says Amos
Bairoch, the founder of Swiss-Prot. "There has been a
tremendous growth in the quantity of biomolecular information
that has become available in the past 10 years, yet
this is only the beginning!" He adds, "Thanks to UniProt
we can continue to provide a wealth of knowledge on the
fascinating universe of proteins." "Such integrated knowledge
in UniProt will facilitate scientific discovery at various
levels of biological organization from genes and proteins
to metabolic pathways, cellular networks, and organisms,"
agrees Cathy Wu.
UniProt can be accessed at www.uniprot.org. The
individual members of the UniProt consortium have their own web pages at
www.ebi.uniprot.org,
expasy.uniprot.org and
www.pir.uniprot.org.
Website: www.uniprot.org |
 |
Scientific
Contacts
Rolf Apweiler
EMBL–European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton, Cambridge
CB10 1SD, United Kingdom
Tel: +44 [0] 1223 494435
E-mail: apweiler@ebi.ac.uk
Amos Bairoch Swiss
Institute of Bioinformatics, CMU, 1 Michel-Servet,
CH-1211 Geneva 4, Switzerland
Tel: +41 22 379 5050
E-mail: amos.bairoch@isb-sib.ch
Cathy H.Wu
Director, Protein Information Resource, Georgetown
University Medical Center, Box 571455, 3900 Reservoir
Road, NW, Washington, DC 20057-1455, USA
Tel: +001 202 687 1039
E-mail: wuc@georgetown.edu |
 |
Press
Contacts
Cath Brooksbank
Scientific Outreach Officer, EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
Tel: +44 [0] 1223 492525
Fax: +44 [0] 1223 494468
E-mail: cath@ebi.ac.uk
Vivienne Gerritsen
Science communication group of Swiss-Prot, Swiss
Institute of Bioinformatics, CMU, 1 Michel-Servet,
CH-1211 Geneva 4, Switzerland
Tel: +41 22 379 5882
Fax: +41 22 379 5858
E-mail: Vivienne.Gerritsen@isb-sib.ch
Lindsey Spindle
Director of Media Relations, Georgetown University
Medical Center, Box 571405, 3900 Reservoir Road,NW,
Washington, DC 20057-1405, USA
Tel: +001 202 687 7707
E-mail: las46@georgetown.edu
Geoff Spencer
National Human Genome Research Institute, National
Institutes of Health, Bethesda, MD 20892, USA
Tel: +001 301 402 0911
E-mail: spencerg@mail.nih.gov
Trista Dawson
EMBL Press Officer, Meyerhofstrasse 1, D-69117
Heidelberg, Germany
Tel: +49 [0] 6221 387 452
Fax: +49 [0] 6221 387 525
E-mail: dawson@embl.de |
 |
|
 |
|