Protein Sequence Database
The Protein Sequence Database a protein structure database is a database that is modeled around the various experimentally
In biology, a protein structure database is a database that is modeled around the various experimentally determined protein structures. Most protein structure databases aim to organize and annotate the protein structures, providing the biological community access to the experimental data in a useful way. A variety of protein sequence databases exist, ranging from simple sequence repositories, which store data with little or no manual intervention in the creation of the records, to expertly-curated universal databases that cover all species and in which the original sequence data are enhanced by the manual addition of further information in each sequence record. As the focus of researchers moves from the genome to the proteins encoded by it, these databases will play an even more important role as central comprehensive resources of protein information. Several the leading protein sequence databases are discussed here, with special emphasis on the databases now provided by the Universal Protein Knowledgebase (UniProt) consortium
The two protein sequence databases SWISS-PROT and PIR are different from the nucleotide databases in that they are both curated. This means that groups of designated curators (scientists) prepare the entries from literature and/or contacts with external experts.
What Is SWISS-PROT??
SWISS-PROT ( 1 ) is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation-The European Bioinformatics Institute; 2 ). The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different line types, each with their format. For standardization purposes, the format of SWISS-PROT ( 3 ) follows as closely as possible that of the EMBL nucleotide sequence database. The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria.
In SWISS-PROT annotation is mainly found in the comment lines (CC), in the feature table (FT) and the keyword lines (KW). Most comments are classified by ‘topics’, an approach that permits easy retrieval of specific categories of data from the database.
Many sequence databases contain, for a given protein sequence, separate entries that correspond to different literature reports. In SWISS-PROT as much as possible data is merged, to minimize the redundancy of the database. If conflicts exist between various sequencing reports they are indicated in the feature table of the corresponding entry.
Integration with other databases
It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences, and protein tertiary structures), as well as with specialized data collections. SWISS-PROT is currently cross-referenced with 24 different databases. Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT.
SWISS-PROT is a protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, posttranslational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.
It was started in 1986 by Amos Bairoch in the Department of Medical Biochemistry at the University of Geneva. This database is generally considered one of the best protein sequence databases in terms of the quality of the annotation. Its size is given in the table below.
TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated into SWISS-PROT. The procedure that is used to produce it was developed by Rolf Apweiler. The annotation of an entry in TrEMBL has not (yet) reached the standards required for inclusion into SWISSPROT proper. Its size is given in the table below.
|Date||Release||# entries||Release||# entries|
|24 Oct 2001||40.1||101,737||18.0||484,388|
|2 Oct 2000||39.7||88,757||14.17||300,152|
SWISS-PROT and TrEMBL are developed by the SWISS-PROT groups at the Swiss Institute of
Bioinformatics (SIB) and at EBI. The databases can be accessed and searched through the SRS system at ExPASy, or one can download the entire database as one single flat file. An example of what an entry looks like is given for the human RAF oncogene protein, ID KRAF_HUMAN.
The SWISS-PROT database has some legal restrictions: the entries themselves are copyrighted, but freely accessible and usable by academic researchers. Commercial companies must buy a license fee from SIB.
The Protein Information Resource (PIR) is a division of the National Biomedical Research
Foundation (NBRF) in the US. It is involved in a collaboration with the Munich Information
Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID). The PIR-PSD (Protein Sequence Database) release 70.01 (22 Oct 2000) contains 254,293 entries.
PIR grew out of Margaret Dayhoff’s work in the middle of the 1960s. It strives to be comprehensive, well-organized, accurate and consistently annotated. However, it is generally believed that it does not reach the level of completeness in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR overlap extensively, there are still many sequences that can be found in only one of them.