EMBL Logo
Travel and Contact  Staff Only  Site Map  Help?   
Research in Molecular Biology
EMBL Grenoble EMBL Hamburg EMBL heidelberg EMBL-EBI Hinxton EMBL Monterotondo
EMBLAbout UsNews and CommunicationPress Releases2003
General Information
News and Communication
Press Release Archive
2003
EMBL in the Press
Publications
Today at EMBL
Courses and Conferences
Seminars
Jobs
Alumni Association
Resource Development
Science and Society
Advanced Training Centre Project
About Us Research Services Education
Press Releases 2003
EMBL–EBI Hinxton, Monday, 15 December 2003
UniProt consortium goes on-line
Press Release 15 December 2003 [PDF]

Today the EMBL–European Bioinformatics Institute [EMBL–EBI], the Swiss Institute of Bioinformatics [SIB] and Georgetown University Medical Center's Protein Information Resource [PIR] announce the launch of UniProt, a new universal protein resource that will be the world's most comprehensive catalogue of information on proteins. UniProt will provide a 'one-stop shop,' allowing easy access to all the publicly available information on proteins.

Protein sequence databases have become a crucial resource for molecular biologists, allowing them to analyse the proteomes of newly sequenced organisms, to make intelligent predictions about the functions of newly identified proteins, and to move towards understanding how proteins interact to create pathways, networks and entire systems. To do this efficiently they need access to a defined set of features describing all the proteins that are known to exist or have been predicted to exist by extrapolation from their gene sequences.

Until recently there have been two major efforts to make this information publicly available. One was a collaboration between SIB and EMBLÔEBI that resulted in two complementary databases, Swiss-Prot [renowned for providing a great depth of information on proteins through high-quality manual curation] and TrEMBL [a much larger database in which information on protein function is derived computationally by comparison with other proteins]. The other was the PIR-International Protein Sequence Database [PIR-PSD], the world's first database of classified and functionally annotated proteins. These databases held different, but overlapping, subsets of proteins. "The launch of UniProt is tremendously exciting because databases that have been running independently for years have come together for the benefit of their users," explains Maria-Jesus Martin, Sequence Database Group coordinator at the EBI.

This unification was made possible by funding from the National Institutes of Health, totalling US $ 15 million over 3 years.The National Human Genome Research Institute [NHGRI] is the primary funding institute, contributing $3 million annually. Other NIH participants are the National Institute of General Medical Sciences [$1 million], the National Library of Medicine [$460,00], the National Institute of Mental Health [$300,000], the National Center for Research Resources [$100,000] and the National Institute of Dental and Craniofacial Research [$50,000].

"Scientists today must face the challenge of understanding an increasingly large amount of data generated by the Human Genome Project and related resources.The UniProt databases will be a critical resource for investigators trying to unlock the secrets in genome sequences, both to understand biology and to translate basic research into improvements in health care," says Peter Good, Ph.D., the NHGRI programme director in charge of the UniProt project.

The UniProt databases launched today are the result of a hectic but immensely productive year of collaboration among the three institutions that make up the UniProt Consortium. "UniProt's structure resembles that of a wedding cake," explains Rolf Apweiler, UniProt's Principal Investigator. "Each tier of the cake represents a different database, optimized for different uses."

Underpinning the entire project is the UniProt Archive [UniParc] – the most comprehensive publicly accessible non-redundant protein sequence database available. Protein sequences are loaded daily from the public databases, including not only Swiss-Prot, TrEMBL and PIR-PSD, but also the EMBLÔBank/DDBJ/GenBank nucleotide sequence databases, the Ensembl database of animal genomes, the International Protein Index [IPI], the Protein Data Bank [PDB], the NCBI's Reference Sequence Collection [RefSeq], model organism databases such as FlyBase and WormBase, and protein sequences from the European, American, and Japanese Patent Offices. UniParc provides cross-references to the source databases, sequence versions and status.

The next layer of the wedding cake – and the centerpiece of the UniProt Consortium's activities – is the UniProt Knowledgebase [UniProt] unified from Swiss-Prot, TrEMBL and PIR-PSD. "This is the place to go if you want to know everything there is to know about a specific protein," explains Maria-Jesus Martin. The Knowledgebase contains a non-redundant set of entries that include information on protein function and classification, as well as cross-references to more than 40 other resources. The UniProt Knowledgebase consists of two parts, one containing fully manually annotated records and another with computationally analysed records awaiting full manual annotation. Sequences for which new functional, structural and biochemical data have been published are prioritized for annotation.The two sections will continue to be referred to as Swiss-Prot and TrEMBL, respectively.

Researchers will also be able to submit protein sequences directly to the Knowledgebase using a new web-based submission tool called SPIN. SPIN replaces Swiss-Prot's email- based submission system, making it much easier for researchers to submit sequences. "SPIN's forms allow researchers to submit more information about a protein's features in a more structured way," explains Vincent Lombard, who coordinated the development of SPIN. "This improves the efficiency of submission for both submitters and curators."

The top tier of the wedding cake contains three sub-layers – UniRef100, UniRef90 and UniRef50 – collectively known as UniRef [for UniProt non-redundant reference]. "The UniRef databases will use newly developed automatic procedures to combine closely related sequences into a single record," explains Cathy Wu, whose group at PIR is responsible for their creation.Wu continues, "UniRef100 is a non-redundant version of all the sequences in the Knowledgebase, UniRef90 collapses all the sequences that are 90% or more identical into a single record, and UniRef50 collapses sequences that are at least 50% identical. UniRef50 speeds up searching significantly and doesn't reduce the effectiveness of homology searching. The three UniRef databases allow the user to choose between a fast search and a truly comprehensive one."

"With UniProt we can address some aspects of the challenges that life scientists are currently facing," says Amos Bairoch, the founder of Swiss-Prot. "There has been a tremendous growth in the quantity of biomolecular information that has become available in the past 10 years, yet this is only the beginning!" He adds, "Thanks to UniProt we can continue to provide a wealth of knowledge on the fascinating universe of proteins." "Such integrated knowledge in UniProt will facilitate scientific discovery at various levels of biological organization from genes and proteins to metabolic pathways, cellular networks, and organisms," agrees Cathy Wu.

UniProt can be accessed at www.uniprot.org. The individual members of the UniProt consortium have their own web pages at www.ebi.uniprot.org, expasy.uniprot.org and www.pir.uniprot.org.

Website: www.uniprot.org
Scientific Contacts

Rolf Apweiler
EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
Tel: +44 [0] 1223 494435
E-mail: apweiler@ebi.ac.uk

Amos Bairoch
Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland
Tel: +41 22 379 5050
E-mail: amos.bairoch@isb-sib.ch

Cathy H.Wu
Director, Protein Information Resource, Georgetown University Medical Center, Box 571455, 3900 Reservoir Road, NW, Washington, DC 20057-1455, USA
Tel: +001 202 687 1039
E-mail: wuc@georgetown.edu
Press Contacts

Cath Brooksbank
Scientific Outreach Officer, EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
Tel: +44 [0] 1223 492525
Fax: +44 [0] 1223 494468
E-mail: cath@ebi.ac.uk

Vivienne Gerritsen
Science communication group of Swiss-Prot, Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland
Tel: +41 22 379 5882
Fax: +41 22 379 5858
E-mail: Vivienne.Gerritsen@isb-sib.ch

Lindsey Spindle
Director of Media Relations, Georgetown University Medical Center, Box 571405, 3900 Reservoir Road,NW, Washington, DC 20057-1405, USA
Tel: +001 202 687 7707
E-mail: las46@georgetown.edu

Geoff Spencer
National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
Tel: +001 301 402 0911
E-mail: spencerg@mail.nih.gov

Trista Dawson
EMBL Press Officer, Meyerhofstrasse 1, D-69117 Heidelberg, Germany
Tel: +49 [0] 6221 387 452
Fax: +49 [0] 6221 387 525
E-mail: dawson@embl.de
Last updated by: Office of Information and Public Affairs, 5 October 2006
EMBL Web Support