Gene-Protein Database of Escherichia coli K-12, Edition 6
Chapter
115
RUTH A. VanBOGELEN, KELLY Z. ABSHIRE, ALEXANDER PERTSEMLIDIS, ROBERT L. CLARK, and FREDERICK C. NEIDHARDT
Determination of the complete sequence of E. coli K-12 strains MG1655 and W3110 and analysis of the biological import of the sequences have proceeded rapidly since publication of several of the Escherichia coli and Salmonella 2nd Edition chapters. As an interim measure until new chapters are written, one link to the Internet lists will provide you with many links to individual Web sites, each accentuating different aspects of genomic analysis of E. coli. The link is: http://www.grs.nig.ac.jp/WGR/link/link_E.coli_e.html
The gene-protein database is unique among other databases being constructed for Escherichia coli because it is configured on a global approach that allows the cell’s total complement of polypeptides to be examined at one time (27). Two-dimensional (2-D) polyacrylamide gel electrophoresis (PAGE) permits this global approach by separating complex mixtures of polypeptides into individual polypeptide species (called spots on the 2-D gel) by two independent separation steps, isoelectric focusing and sodium dodecyl sulfate (SDS)-PAGE (38). However, the reason for creating this database is not to build the "master" 2-D gel for E. coli, because the small number of investigators routinely using 2-D gels would not justify this enormous venture. The purpose is to provide other investigators with physiological and regulatory data on the entire set of E. coli proteins.
The ultimate goal of this database is to catalog when, why, and to what level each protein-encoding gene is expressed. Two projects that tackle this problem are under way. The first project, called the Genome Expression Map, is designed to link each of the protein-encoding genes to a spot on the 2-D gel. The second project, called the Response/Regulation Map, is focused on cataloging the conditions under which each of these genes is expressed and on determining what molecules regulate their expression. This database includes two types of information. First, for identified proteins, we provide the following information: gene name, protein name, EC number, SWISS-PROT accession number, GenBank code, metabolic class, position and orientation of the gene on the chromosome, molecular weight (MW), and pI (calculated from DNA sequence information). This provides sufficient information to allow a user to do a literature search or to access more information in other databases. Second, for identified as well as unidentified proteins, information obtained from 2-D gels is included: MW and pI of the protein (estimated from its migration on the gels), abundance of individual proteins grown under different conditions, and memberships of proteins in particular regulons and/or stimulons. Some of the other databases (e.g., SWISS-PROT [5]) also provide linkage to this database by including the 2-D spot name (called an alpha-numeric, or A-N, name) in their information list. The entire gene-protein database (including the 2-D gel images) is available electronically and can be obtained through anonymous ftp at the database repository at the National Center for Biotechnology Information (see the section on Information Exchange).
The gene-protein database was begun immediately after the introduction of the 2-D gel method (38). The first set of data (which is still included in the database) was a catalog of 140 individual proteins (21 were identified) that reported variations in the levels of each protein in cultures grown under different growth conditions (43). The first important step in establishing the structure of the database, the alpha-numeric naming system used to uniquely identify each 2-D gel spot, was described in that catalog.
In 1980, the first set of reference 2-D gels was published along with the identities of 81 more proteins (6). Five years into the development of the database, it became apparent that in order to track each 2-D gel spot through numerous gels, a standard cumulative map of each type of 2-D gel had to be established. Each reference 2-D gel was overlaid with a grid to give each spot a unique x and y coordinate. The alpha-numeric naming system was maintained to match proteins among the reference gels.
In 1983, the information in the database was linked to the chromosome in the first gene-protein index (33). In that update and review, the identities of 157 proteins were listed, and in addition, many unidentified proteins were mapped to a small region on the chromosome (33). Throughout the 1980s, many published reports on the responses of proteins observed on 2-D gels used (and added to) the protein identifications found in the index. Most of these reports gave physiological and regulatory information about protein spots on 2-D gels (both those identified and those not identified).
In 1990, all of this information from the previous gene-protein indexes and from many separate reports was gathered together, put into an electronic database, and published as the gene-protein database (58). A year later, edition 4 was published. That edition introduced a new standard 2-D gel that was generated by using a standardized 2-D gel method (61). The switch to this new standardized method was important: it allowed other investigators to reproduce the protein pattern so that they could access and contribute to the information in the database independently of the database laboratory. Edition 5 (63) included the first set of identifications made using the T7 expression system on Kohara clones and also announced that an electronic version of the database had been released to the database repository at the National Center for Biotechnology Information so that it would be available in a more usable form and could be updated more frequently.
This sixth edition of the database introduces several changes necessary to accommodate the input of data from many sources. A new naming system was started, not to replace the alpha-numeric naming system but to prevent redundancy in this system. The alpha-numeric names will now be reserved for proteins that have been identified as the product of a particular gene. The new naming system is being used for the Response/Regulation Map and will be used for the Genome Expression Map to name proteins that have been observed but await identification. In the Response/Regulation Map project, as many proteins as possible are matched to proteins already in the database and are assigned that alpha-numeric name, but others are matched only within the Response/Regulation Map project. These will be given a Response/Regulation Map name, which is an R followed by a four-digit number (e.g., R1698). In the Genome Expression Map project, the proteins matched to a single open reading frame (ORF) will be assigned alpha-numeric names and will be added to Table 2 (see p. 2094) under the appropriate gene name and also in the SWISS-PROT (5) and E. coli (24) databases (as a reference between these databases). The proteins that cannot be matched to a single ORF will be given only a Genome Expression Map name, which is an X followed by a four-digit number (e.g., X2404). These proteins appear in Table 2 with reference only to their chromosomal map positions until further analysis allows a match to an ORF. Table 1 (see p. 2076) will continue to serve as a list of all proteins found on 2-D gels, which are listed in order of alpha-numeric name, Genome Expression Map name, and Response/Regulation Map name.
Table 1-01Protein spots on 2-D gels |
Table 1-02Protein spots on 2-D gels |
Table 1-03Protein spots on 2-D gels |
Table 1-04Protein spots on 2-D gels |
Table 1-05Protein spots on 2-D gels |
Table 1-06Protein spots on 2-D gels |
Table 1-07Protein spots on 2-D gels |
Table 1-08Protein spots on 2-D gels |
Table 1-09Protein spots on 2-D gels |
Table 1-10Protein spots on 2-D gels |
Table 1-11Protein spots on 2-D gels |
Table 1-12Protein spots on 2-D gels |
Table 1-13Protein spots on 2-D gels |
Table 1-14Protein spots on 2-D gels |
Table 1-15Protein spots on 2-D gels |
Table 1-16Protein spots on 2-D gels |
Table 1-17Protein spots on 2-D gels |
Table 1-18Protein spots on 2-D gels |
Table 2-01E. coli proteins identified on 2-D gels |
Table 2-02E. coli proteins identified on 2-D gels |
Table 2-03E. coli proteins identified on 2-D gels |
Table 2-04E. coli proteins identified on 2-D gels |
Table 2-05E. coli proteins identified on 2-D gels |
Table 2-06E. coli proteins identified on 2-D gels |
It is predicted that within the next few years, the entire DNA sequence of E. coli will be determined. The next steps in the analysis of E. coli will be (i) to confirm that the proposed ORFs encode proteins, (ii) to determine how these genes are regulated, and (iii) to elucidate the function of these proteins. The plans for this database are designed to assist in this analysis. Two complementary projects to develop this database are under way. Each of these projects will provide a separate data set, and a third data set will be provided by the DNA sequencing projects. Eventually, the three data sets will converge, because each contains information on the same set of 3,500 to 4,000 E. coli proteins.
The initial concept of the Genome Expression Map was published in 1980 (34). All of those early protein identifications were made one at time, primarily using purified proteins as markers to identify the spots. The supply of purified E. coli proteins was quickly exhausted, and so other methods to identify proteins were tried (44). The Genome Expression Map was intended to provide a method for identifying all of the proteins on the E. coli chromosome without relying on biochemists to purify the proteins or geneticists to construct mutants in each protein-encoding gene. Expressing genes carried on plasmids seemed like the ideal approach. At that time, a recombinant plasmid library, constructed by Clarke and Carbon (11), was available. One method for expressing proteins from recombinant plasmids had been described (49), and two more expression methods were developed (34, 47). Although many identifications have been made by using these three expression methods, all of the methods failed to consistently express all of the proteins encoded by the plasmids. The primary reason for the failure was that each of these methods relied on the E. coli transcription system, and gene expression was thus controlled by the cell’s own regulatory mechanisms, which do not allow equal transcription of all genes.
The new approach currently used for the Genome Expression Map project focuses on simultaneously identifying many gene products. This will be accomplished by using the sets of ordered clones produced and sequenced by other laboratories (13, 23), by expressing the genes on these clones with a selective expression system, and by matching the proteins produced by each clone to ORFs found on the clone. By using clones that have been mapped to a position on the chromosome and completely sequenced, the cloning is easier (all restriction sites are known), and a list of potential protein products is already generated. The expression system uses phage transcription systems (56), which offer two advantages over the E. coli transcription system. First, because phage RNA polymerases appear to ignore the transcription signals (encoded within the DNA sequence) used by the E. coli RNA polymerase to start and stop transcription, every ORF on a plasmid should be transcribed within a single transcription unit. Second, by taking advantage of the sensitivity of the E. coli RNA polymerase and the resistance of the phage RNA polymerases to the antibiotic rifampin, the plasmid-encoded genes can be expressed exclusively. The sizes of the ordered chromosomal fragments allow 10 to 20 proteins to be identified simultaneously (based on the assumption that the average gene is 1 kb long), and yet this number of proteins is still small enough to allow unambiguous matching of most proteins to ORFs (in most cases) because of the variation in charges and masses of proteins (migration on the 2-D gel) and also because all 20 genes will rarely be expressed from a single strand. The experimental methods used for this project have been described in detail elsewhere (50) and are presented here only briefly.
To express the genes from ordered sets of clones, the E. coli DNA from these clones is moved into a special plasmid vector, and then the recombinant plasmid is transformed into a special E. coli strain. The special vector possesses several important features, including (i) a low-copy replicon to minimize the effects of certain genes that are lethal to E. coli when present in high copy, (ii) the lacZ gene within the multiple cloning site to allow simple screening for plasmids containing inserts, and (iii) two different phage promoters flanking the multiple cloning site (oriented opposite to each other) to provide a means of independently expressing the protein-encoding genes on each DNA strand. Each of the special strains used to express the genes on these plasmids carries one of the phage RNA polymerase genes under the control of an inducible promoter to prevent the expression of the plasmid-encoded genes until the inducer is added, again minimizing the effects of lethal genes. The strains are also recA mutants; thus, recombination between the E. coli DNA on the plasmid and the chromosome is prevented.
To tag the proteins produced from the plasmid-encoded genes, a mixture of3H-amino acids is added to a culture in which the phage RNA polymerase has been induced and the E. coli RNA polymerase has been inhibited by rifampin. These 3H-labeled proteins are separated on 2-D gels. Because there is virtually no contamination from chromosomally encoded proteins to serve as landmark spots on the 2-D gels, the 3H-labeled extracts are also comigrated on a 2-D gel with a whole-cell extract made from a culture (strain W3110) labeled with [14C]glucose in order to map each plasmid-encoded protein to a precise location on the reference 2-D images (50).
To match the ORFs found in the DNA sequence to spots on the 2-D gels, standard curves (shown in Fig. 5) were prepared by using the large set of proteins that have been identified on 2-D gels and whose genes have been sequenced. From the sequence of the gene, the amino acid composition is deduced, and from the amino acid composition, the pI and MW of the protein are calculated. Plots of pI versus migration in the first dimension and MW versus migration yield the equations that give an estimate of where the products of other genes should migrate. By themselves, these estimates are not sufficient to make a spot identification. However, when the number of candidate spots is reduced to 10 or so through the use of the selective expression system described, matches between ORFs and their protein products can be found. In some cases, no unambiguous assignment of a protein to an ORF can be made. In these cases, the protein will be assigned a Genome Expression Map name until further analysis clarifies which ORF matches the protein. This system of many-at-a-time protein identifications should rapidly increase the information compiled in the Genome Expression Map section of the database.
The Genome Expression Map project specifically addresses the question of whether each ORF identified within the DNA sequence actually expresses a protein. By themselves, the results of such work would be a significant contribution to the study of E. coli. However, because the expression of the ORFs will be determined through analysis on 2-D gels, this project also provides the Response/Regulation Map project with the necessary linkage to the chromosome.
The Response/Regulation Map really began with the first publication describing the 2-D gel method (38). O’Farrell used protein extracts from E. coli and revealed the proteins synthesized under given growth conditions. Many other global studies that used 2-D gels have since been published. Two factors have limited the growth of this part of the database. First, lack of a standardized 2-D gel method (prior to 1991) hindered other investigators from contributing global studies to the database. Only one independent investigator has ever contributed to this part of the database (15). Second, although 2-D gels can resolve about 1,200 protein spots, the methods used to quantify the spots restrict global quantitation to the most abundant 200 to 600 proteins. Manual methods of quantitation are restricted by the low specific activities of radiolabeled amino acids and the time required to punch out individual spots for counting in scintillation counters. Computer-aided image analysis systems were introduced in the 1980s, but quantitation by this method is limited by the slow processing time of the computers, the immaturity of the software, and the narrow linear optical density range of the X-ray film used to capture the gel image.
With the development of faster computers, better image analysis software, and a new method to measure radioactivity in the spots on the gel (phosphorimagers [41]), it is now possible to quickly quantify 1,000 to 1,200 spots per gel and to match the spots among multiple images. When these methods of detection and analysis are used, proteins with steady-state levels of more than 50 molecules per cell or those with synthesis rates accounting for 0.04% of the total protein synthesized during a pulse-label can be detected and included in the analysis of each 2-D gel. The results from the two different analysis methods are expressed differently, as discussed in the footnotes to Table 3 (Table 3 on p. 2101; footnotes on p. 2111). The results of four comprehensive analyses were added to this edition of the database, and eventually, all of the partial analyses will be redone and entered into the database along with data from additional experiments. For each of these, only the data are given; the interpretations and conclusions made from these experiments are given elsewhere (see the footnotes to Tables 1, 2, 3, 4).
Table 3-01Amount of various proteins in different steady-state and transient growth conditions |
Table 3-02Amount of various proteins in different steady-state and transient growth conditions |
Table 3-03Amount of various proteins in different steady-state and transient growth conditions |
Table 3-04Amount of various proteins in different steady-state and transient growth conditions |
Table 3-05Amount of various proteins in different steady-state and transient growth conditions |
Table 3-06Amount of various proteins in different steady-state and transient growth conditions |
Table 3-07Amount of various proteins in different steady-state and transient growth conditions |
Table 3-08Amount of various proteins in different steady-state and transient growth conditions |
Table 3-09Amount of various proteins in different steady-state and transient growth conditions |
Table 3-10Amount of various proteins in different steady-state and transient growth conditions |
Table 3-11Amount of various proteins in different steady-state and transient growth conditions |
Table 4Proteins belonging to stimulons and regulons |
The Response/Regulation Map project is cataloging, through 2-D gel analysis, when (the response) and how (the regulation) each individual protein is expressed. Although this catalog will seldom define the exact function for any individual protein, it is expected to provide many of the clues that will direct the study of each protein’s function and to provide the physiological data needed to help define regulatory elements contained in the DNA sequence by revealing which proteins belong to a particular regulon. Because the information in the Response/Regulation Map has accumulated over many years and from quantitative and qualitative analyses of the 2D gels, the data are presented in two tables (Tables 3 [p. 2101] and 4 [p. 2112]). Quantitative data for all proteins measured in each experiment are listed in Table 3 and are expressed as a ratio for each test condition. The level of proteins (during the labeling period) is expressed either as an a' value or as parts per million. Table 4 gives results of experiments that were qualitatively analyzed (induction of proteins determined visually) or in which only small numbers of proteins were quantified. In these tables, investigators can look up the responses of individual proteins to different conditions. When the mechanism of induction or repression of sets of proteins occurs via a common regulatory molecule, then the term regulon is used to define these coregulated proteins (27). Four regulons are listed in Table 4; the HTP regulon, controlled by σ 32 (31); the OXY regulon, controlled by the OxyR protein (9); the SOS regulon, controlled by the LexA protein (64); and the LRP regulon, controlled by the leucine response regulator (15). For identified proteins, membership in a particular regulon has usually been determined by genetic analysis; in other cases, 2-D gel analysis of mutants in the regulatory molecule revealed that proteins belonged to a regulon. Eventually, many more regulons will be analyzed and added to the database. For example, the proteins belonging to the stimulons induced both by phosphate starvation and by growth in phosphonate should include the members of the regulon controlled by the PhoB transcriptional regulator (65). To verify that these proteins belong to this regulon, phoB mutants will be analyzed.
Part of ECO2DBASE, the electronic version of the database at the National Center for Biotechnology Information (see the section on Information Exchange), is a file not presented as a table here. This file, called Just Genes, lists the genes that encode E. coli proteins that are defined by genetic or biochemical criteria or that are proposed to exist on the basis of analysis of DNA sequence but that have not been identified on 2-D gels. This file was added to the database to serve as a reference for both projects and to assist other investigators interested in a single protein that has not been identified on 2-D gels. For the Genome Expression Map project, this file provides the list of ORFs, which are then matched to proteins produced by the clones. For the Response/Regulation Map, this table provides an estimate of where proteins already known to be induced by different conditions should migrate. In some cases, an identification can be made.
Although the information in the database is accessible independently from the 2-D gels, it is often seen merely as a master 2-D gel database for E. coli. While it does serve this purpose, there are many other applications for this database. The global approach of this database offers a special set of data to E. coli investigators because it links a genome analysis (Genome Expression Map) with physiological and regulatory analyses (Response/Regulation Map). These types of cellular protein databases are also being constructed for Drosophila melanogaster, mice, rats, and humans (reviewed in reference 62). Each eukaryotic database focuses on a specific type of cell, tissue, or body fluid. The aim of the Drosophila database is to study the variations in individual proteins in different developmental processes (51). The mammalian databases are trying to find proteins altered by disease states of cells and are also examining the effects of drug therapies. Many of the human databases are also linked to the human genome project (8).
The database primarily serves two types of applications: (i) for individual proteins, the database lists how the level and/or synthesis rate varies under different conditions and in different mutant strains; and (ii) for diagnosing the physiological state of a culture, the database identifies sets of proteins that are known by 2-D gel analysis to respond to a particular condition. Table 3 lists the responses of individual proteins to several conditions, making it relatively easy to identify the groups of proteins that respond similarly. With the recent developments in image analysis, more proteins can be analyzed per experiment. In all cases, when the protein spot has been identified as the product of a gene (through the Expression Map), subsequent analysis can go much further.
Perhaps one of the best examples of the use of the database to study an individual gene is the universal stress protein. This protein, C013.5, is a fairly abundant protein under the standard growth conditions used for the database (aerobic growth in glucose minimal medium at 37°C) but was observed to be induced by almost all of the stress conditions tested (Table 4). By reverse genetics (using protein purified on 2-D gels), the gene was identified and cloned, the DNA sequence was determined, and mutants were made (36). None of the known regulatory proteins for the stress responses appear to control this gene. Several phenotypes for the null mutant have been observed, which suggests that the protein is involved in regulating the utilization of glucose and the intermediates of glucose metabolism and also in regulating the steps involved in the differentiation of cells into an easily recoverable postexponential state (37). No other studies of either of these processes had ever identified this protein.
The physiological state of a culture is very difficult to diagnose. Many techniques for measuring or examining a single molecule or enzymatic activity have been developed. A more global look at the physiological states of cells can be taken by means of 2-D gel analysis. This approach allows investigators to alternate between 2-D gel analysis of physiological states of cells and genetic and biochemical analyses of individual genes. The best example of this application is the study of the heat shock response. One of the first global studies done by 2-D gels was of the response to a temperature shift (26). Early studies of the responses to changes in temperature had indicated that protein synthesis was unaffected (for shifts from 37 to 42°C in which the growth rate is unchanged). However, examination of pulse-labeled proteins on 2-D gels revealed that the synthesis rates of almost all proteins change transiently (26). The rate of synthesis of a small set of proteins was found to increase dramatically after a temperature shift-up. Later, 2-D gel analysis of a temperature-sensitive mutant revealed that this set of proteins was part of a regulon (30). Many genetic and biochemical studies that characterized the regulatory gene and its protein followed (31). Many of the members of this regulon had previously been characterized through genetic and biochemical analyses and were subsequently identified as heat shock proteins by means of 2-D gel analysis (e.g., see references 55 and 57). Even the signal transduction pathway for this regulon has been partially studied by 2-D gel analysis (60). Many of the stress conditions listed in Table 4 were used as part of the study of inducers of heat shock proteins. This type of global analysis is beginning to play an important part in expanding our information on other regulons as well (e.g., the LRP regulon [15]), which had previously been studied extensively through the biochemical and genetic analyses of one (or a small set) of the regulon members.
The database contains information on the levels of certain proteins at various growth rates that can prove useful in yet another way. For example, this information was used as the basis for experiments that used a novel approach to estimating the growth rate of Salmonella typhimurium (official designation, Salmonella enterica serovar Typhimurium) while these bacteria resided within macrophage host cells (1). Within a certain range, the levels of various translation factors and ribosomal proteins vary directly with growth rate (43). The level of ribosomal protein L7/L12 seen on 2-D gels produced from intracellular S. typhimurium suggested that the intracellular bacteria were growing rapidly. Prior to these experiments, the growth rates of intracellular bacteria had been estimated solely by counting viable bacteria following lysis of the host cells. The viable-count approach had indicated that intracellular S. typhimurium cells were growing quite slowly. These contrasting results led to further experiments in which it was determined that the intracellular bacteria consisted of at least two populations, one not dividing but viable and the other rapidly dividing (1).
A third type of query of the database is used to identify cellular trends for proteins. For example, Savageau used the database to look at the distribution of MW of proteins (52). Similar types of distributions for pI, amino acid usage, abundances of different classes of proteins, and even consensus sequences within the promoter regions for sets of coregulated genes could be determined by using the information in the database, especially as the number of identified proteins (in the Genome Expression Map project) and the number of conditions (in the Response/Regulation Map project) increases to represent a larger fraction of the total number of E. coli proteins. Once 2-D gel databases for other bacterial species are initiated, interesting comparative studies will be possible.
Five figures are included in the database: the three reference 2-D gels published in a previous edition of the database (63) (Fig. 1, 2, 3), one new reference 2-D gel that represents the Response/Regulation Map (Fig. 4), and a figure that gives the distributions of MW and pI for the proteins identified on these reference gels (Fig. 5). The reference gels are overlaid with grids, and the exact coordinates for each protein in the database are listed in Table 1 under the spot name. The coordinates for Fig. 4 are assigned by the computer program. A coarse grid was placed on the figure to locate the spots. The equations listed in Fig. 5 were used to estimate the MWs and pIs of the proteins listed in Table 1.
The volume of data found in this database is difficult to present as tables, especially considering the numerous starting points for posing questions of the database. Users are encouraged to obtain the electronic version of the database (see section on Information Exchange).
Table 1 (p. 2076) gives the positions of protein spots on 2-D gels and the MW and pI for each protein. This table is sorted in order by the spot name, first by alphanumeric names and then by the Response/Regulation Map names. All of the protein spots listed in other tables of the database are listed in Table 1. All of the spots observed in Fig. 1 and 4 have been assigned names, but some have no data entered and thus have not been included in this tabular version of the database. They are listed in the electronic version, ECO2DBASE (see the section on Information Exchange). Table 1 lists the coordinate positions (on Fig. 1, 2, 3, 4) for the spots, the calculated MW and pI of each identified protein, and an estimated MW and pI for every protein in the table.
Table 2 (p. 2094) lists all of the proteins that have been identified as products of particular genes (or ORFs found within the DNA sequence) or are known proteins. The table is sorted by gene name, and it references all of the information in the Expression Map. The following types of information for each protein spot are included: gene name, protein name (if one has been assigned), alphanumeric name, category of function (48), EC number, SWISS-PROT number, GenBank codes, direction of the gene on the chromosome, genetic map location, physical map location (using the Kohara miniset to approximate the location), basis of the identification, and donor of the material used in the identification. Table 2 lists some proteins expressed from a specific Kohara clone but not linked to a gene contained on that clone.
Table 3 (p. 2101) lists all proteins included in a global study in which the level or synthesis rates of proteins were measured. Columns 3 to 14 represent steady-state growth conditions; the next 5 columns list growth transition conditions. The table presents the data, and the footnotes give a brief description of the experiment and/or the paper that originally presented the data. Included in this table are the gene names associated with identified proteins.
Table 4 (p. 2112) lists the protein spots induced by one or more of the conditions not listed in Table 3. Y indicates that the proteins appeared to be induced, according to visual analysis of the 2-D gels, and Y followed by a number indicates the induction ratio of that protein. This table also lists proteins belonging to one or more regulons (only the HTP, SOS, OXY, and LRP regulons have been included so far).
Information exchange is a priority issue for the database. By 1990, information from numerous publications, laboratory notebooks, and the gene-protein index had all been entered into an electronic version of the database. In 1992, the electronic version was deposited at the database repository at the National Center for Biotechnology Information, and updates were submitted to make all of the information accessible to investigators. Large-volume information databases are best used in electronic form, and users are encouraged to obtain the database through anonymous ftp from the repository. The Internet address is ncbi.nlm.gov or 130.14.20.1 in the directory /ncbi/repository/ECO2DBASE. The reference 2-D gels are in the GELS directory, and the database and information files are in the edition6 directory. For those users who do not have access to Internet, a copy of the database can be obtained from the authors (please specify a disk format).
The alphanumeric names of proteins that have been identified have been incorporated into the other databases, including the SWISS-PROT protein database (5) and the ECD database (24), so that users can easily and accurately move among the different databases. A new database for E. coli (based on the Caenorhabditis elegans database) is being developed. It will serve as an encyclopedia of all the information known about E. coli (Staffan Bergh, personal communication). All of the independent databases are being included in this encyclopedia. The gene-protein database, including the 2-D reference gels, has already been entered.
Other investigators can contribute information to the database. For the Genome Expression Map project, samples of purified proteins can be sent to assist in the identification project. For the Response/Regulation Map, investigators are encouraged to submit physiological and regulatory information from their own 2-D gel analyses (as was done by B. Ernsting and R. Matthews [15]), although this requires that the 2-D gel pattern closely match that of the reference gels.
The Genome Expression Map project is supported by grant DMB-8903787 from the National Science Foundation and grant GM17892 from the National Institutes of Health (NIH). Current work on the Response/Regulation Map is supported through Parke-Davis Pharmaceutical Research. A. Pertsemlidis was supported by NIH grant GM08352–784525–31002.
We thank the many investigators (listed in Table 2) who have contributed biological material for protein identifications. We thank Amos Bairoch for assistance with the gene names and SWISS-PROT accession numbers and Manfried Kroger and Kenn Rudd for their assistance with map positions of genes. We also acknowledge all of the scientists who have worked on the database in the past: David Appleby, Philip L. Bloch, Jacqueline A. Bogan, Madhumita Ghosh, Sherrie Herendeen, M. Elizabeth Hutton, Douglas Irvine, Peggy LeMaux, Steen Pedersen, Teresa A. Phillips, Sankar P. Reddy, Solvejg Reeh, and Vicki Vaughn.
References
1. Abshire, K. Z., and F. C. Neidhardt. 1993. Growth rate paradox of Salmonella typhimurium within host macrophages. J. Bacteriol. 175:3744–3748.
2. Ames, G. F.-L., and K. Nikaido. 1976. Two-dimensional gel electrophoresis of membrane proteins. Biochemistry 15:616–622.
2a. Allen, S. P., J. O. Polazzi, J. K. Gierse, and A. M. Easton. 1992. Two novel heat shock genes encoding proteins produced in response to heterologous protein expression in Escherichia coli. J. Bacteriol. 174:6938–6947.
3. Ang, D., G. N. Chandrasekhar, M. Zylicz, and C. Georgopoulos. 1986. Escherichia coli grpE gene codes for heat shock protein B25.3, essential for both lambda DNA replication at all temperatures and host growth at high temperature. J. Bacteriol. 167:25–29.
4. Bachmann, B. J. 1990. Linkage map of Escherichia coli K-12, edition 8. Microbiol. Rev. 54:130–197.
5. Bairoch, A., and B. Boeckmann. 1993. The SWISS-PROT protein sequence data bank recent developments. Nucleic Acids Res. 21:3093–3096.
6. Bloch, P. L., T. A. Phillips, F. C. Neidhardt. 1980. Protein identifications of O’Farrell two-dimensional gels: locations of 81 Escherichia coli proteins. J. Bacteriol. 141:1409–1420.
7. Blumenthal, R. M., P. G. Lemaux, F C. Neidhardt, and P. P. Dennis. 1976. The effects of the relA gene on the synthesis of aminoacyl-tRNA synthetases and other transcription and translation proteins in Escherichia coli A. Mol. Gen. Genet. 149:291–296.
8. Celis, J. E., H. H. Rasmussen, E. Olsen, P. Madsen, H. Leffers, B. Honore, K. Dejgaard, P. Gromov, H. J. Hoffmann, and M. Nielsen. 1993. The human keratinocyte two-dimensional gel protein database: update 1993. Electrophoresis 14:1091–1198.
9. Christman, M. F., R. W. Morgan, F. S. Jacobson, and B. N. Ames. 1985. Positive control of a regulon for defenses against oxidative stress and some heat-shock proteins in Salmonella typhimurium. Cell 41:753–762.
10. Chuang, S.-E., and F. R. Blattner. 1993. Characterization of twenty-six new heat shock genes of Escherichia coli. J. Bacteriol. 175:5242–5252.
11. Clarke, L., and J. Carbon. 1976. A colony bank containing synthetic ColE1 hybrid plasmids representative of the entire E. coli genome. Cell 9:91–99.
12. Copeland, B. R., R. J. Richter, and C E. Furlong. 1982. Renaturation and identification of periplasmic proteins in two-dimensional gels of Escherichia coli. J. Biol. Chem. 257:15065–15071.
13. Daniels, D. L., G. Plunkett, V. Burland, and F. Blattner. 1992. DNA sequence of E. coli. I. The region from 84.5 to 86.5 minutes. Science 257:771–778.
14. Engstrom, P., and G. L. Hazelbauer. 1980. Multiple methylation of methyl-accepting chemotaxis proteins during adaptation of E. coli to chemical stimuli. Cell 20:165–171.
15. Ernsting, B. R., M. R. Atkinson, A. J. Ninfa, and R. G. Matthews. 1992. Characterization of the regulon controlled by the leucine-responsive regulatory protein in Escherichia coli. J. Bacteriol. 174:1109–1118.
16. Gage, D. J., and F. C. Neidhardt. 1993. Adaptation of Escherichia coli to the uncoupler of oxidative phosphorylation 2,4-dinitrophenol. J. Bacteriol. 175:7105–7108.
17. Goldstein, J., N. S. Pollitt, and M. Inouye. 1990. Major cold shock protein of Escherichia coli. Proc. Natl. Acad. Sci. USA 87:283–287.
18. Goodlove, P. E., P. R. Cunningham, J. Parker, and D. P. Clark. 1989. Cloning and sequence analysis of the fermentative alcohol-dehydrogenase-encoding gene of Escherichia coli. Gene 85:209–214.
19. Gudas, L. J., and D. W. Mount. 1977. Identification of the recF (tif) gene product of Escherichia coli. Proc. Natl. Acad. Sci. USA 74:5280–5284.
20. Herendeen, S. H., R. A. VanBogelen, F. C. Neidhardt. 1979. Levels of major proteins of Escherichia coli during growth at different temperatures. J. Bacteriol. 139:185–194.
21. Jones, P. G., R. A. VanBogelen, and F. C. Neidhardt. 1987. Induction of proteins in response to low temperature in Escherichia coli. J. Bacteriol. 169:2092–2095.
22. Kaltschmidt, E., and H. G. Wittmann. 1970. Ribsomal proteins. VII. Two-dimensional polyacrylamide gel electrophoresis for fingerprinting of ribosomal proteins. Anal. Biochem. 36:401–412.
23. Kohara, Y., K. Akiyama, and K. Isono. 1987. The physical map of the whole E. coli chromosome: application of a new strategy for rapid analysis and sorting of a large genomic library. Cell 50:495–508.
24. Kroger, M., R. Wahl, and P. Rice. 1993. Compilation of DNA sequences of Escherichia coli (update 1993). Nucleic Acids Res. 21:2973–3000.
25. Kroh, H. E., and L. D. Simon. 1990. The C1pP component of C1p protease is the sigma-32 dependent heat shock protein F21.5. J. Bacteriol. 172:6026–6034.
26. Lemaux, P. G., S. L. Herendeen, P. L. Bloch, and F. C. Neidhardt. 1978. Transient rates of synthesis of individual polypeptides in E. coli following temperature shifts. Cell 13:427–434.
27. Neidhardt, F. C. 1987. Multigene systems and regulons, p. 1313–1317. In F. C. Neidhardt, J. L. Ingraham, K. B. Low, B. Magasanik, M. Schaecter, and H. E. Umbarger (ed.), Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology, vol. 2. American Society for Microbiology, Washington, D.C.
28. Neidhardt, F. C., P. L. Bloch, S. Pedersen, and S. Reeh. 1977. Chemical measurement of steady-state levels of ten aminoacyl-transfer ribonucleic acid synthetases in Escherichia coli. J. Bacteriol. 129:378–387.
29. Neidhardt, F. C., P. L. Bloch, and D. F. Smith. 1974. Culture media for enterobacteria. J. Bacteriol. 199:736–747.
30. Neidhardt, F. C., and R. A. VanBogelen. 1981. Positive regulatory gene for temperature-controlled proteins in Escherichia coli. Biochem. Biophys. Res. Commun. 100:894–900.
31. Neidhardt, F. C., and R. A. VanBogelen. 1987. Heat shock response, p. 1334–1345. In F. C. Neidhardt, J. L. Ingraham, K. B. Low, B. Magasanik, M. Schaecter, and H. E. Umbarger (ed.), Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology, vol. 2. American Society for Microbiology, Washington, D.C.
32. Neidhardt, F. C., R. A. VanBogelen, and V. Vaughn. 1984. The genetics and regulation of heat-shock proteins. Annu. Rev. Genet. 18:295–329.
33. Neidhardt, F. C., V. Vaughn, T. A. Phillips, and P. L. Bloch. 1983. Gene-protein index of Escherichia coli K-12. Microbiol. Rev. 47:231–284.
34. Neidhardt, F. C., R. Wirth, M. W. Smith, and R. VanBogelen. 1980. Selective synthesis of plasmid-coded proteins by Escherichia coli during recovery from chloramphenicol treatment. J. Bacteriol. 143:535–537.
35. Nomenclature Committee of the International Union of Biochemistry. 1984. Enzyme Nomenclature. Academic Press, Inc., New York.
36. Nystrom, T., and F. C. Neidhardt. 1992. Cloning, mapping, and nucleotide sequence of a gene encoding a universal stress protein in Escherichia coli. Mol. Microbiol. 6:3187–3198.
37. Nystrom, T., and F. C. Neidhardt. 1993. Isolation and properties of a mutant of Escherichia coli with an insertional inactivation of the uspA gene, which encodes a universal stress protein. J. Bacteriol. 175:3949–3956.
38. O’Farrell, P. H. 1975. High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem. 250:4007–4021.
39. O’Farrell, P. Z., H. M. Goodman, and P. H. O’Farrell. 1977. High resolution two-dimensional electrophoresis of basic as well as acidic proteins. Cell 12:1133–1142.
40. Parker, J. 1984. Identification of the purC gene product of Escherichia coli. J. Bacteriol. 157:712–717.
41. Patterson, S. D., and G. I. Latter. 1993. Evaluation of storage phospho imaging for quantitative analysis of 2-D gels using the Quest II system. BioComputing 15:1076–1083.
42. Patton, W. F., M. F. Lopez, P. Barry, and W. M. Skea. 1992. A mechanically strong matrix for protein electrophoresis with enhanced silver staining properties. BioTechniques 12:580–585.
43. Pedersen, S., P. L. Bloch, S. Reeh, and F. C. Neidhardt. 1978. Patterns of protein synthesis in E. coli: a catalog of the amount of 140 individual proteins at different growth rates. Cell 14:179–190.
44. Phillips, T. A., P. L. Bloch, and F. C. Neidhardt. 1980. Protein identifications on O’Farrell two-dimensional gels: locations of 55 additional Escherichia coli proteins. J. Bacteriol. 144:1024–1033.
45. Phillips, T. A., V. Vaughn, P. L. Bloch, and F. C. Neidhardt. 1987. Gene-protein index of Escherichia coli K-12, edition 2, p. 919–966. In F. C. Neidhardt, J. L. Ingraham, K. B. Low, B. Magasanik, M. Schaecter, and H. E. Umbarger (ed.), Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology, vol. 2. American Society for Microbiology, Washington, D.C.
46. Reeh, S., and S. Pedersen. 1979. Post-translational modification of Escherichia coli ribosomal protein S6. Mol. Gen. Genet. 173:183–187.
47. Reeve, J. 1979. The use of minicells for bacteriophage directed polypeptide synthesis. Methods Enzymol. 68:493–503.
48. Riley, M. 1993. Functions of the gene products of Escherichia coli. Microbiol. Rev. 57:862–952.
49. Sancar, A., A. M. Hack, and W. D. Rupp. 1979. Simple method for identification of plasmid-coded proteins. J. Bacteriol. 137:692–693.
50. Sankar, P., M. E. Hutton, R. A. VanBogelen, R. L. Clark, and F. C. Neidhardt. 1993. Expression analysis of cloned chromosomal segments of Escherichia coli. J. Bacteriol. 175:5145–5152.
51. Santaren, J. F. 1990. Towards establishing a protein database of Drosophila. Electrophoresis 11:254–267.
52. Savageau, M. A. 1986. Proteins of Escherichia coli come in sizes that are multiples of 14kDa: domain concepts and evolutionary implications. Proc. Natl. Acad. Sci. USA 83:1198–1202.
52a. Sood, P., C. G. Lerner, T. Shimamoto, Q. Lu, and M. Inouye. 1994. Characterization of Era, essential Escherichia coli GTPase. Mol. Microbiol. 12:201–208.
53. Smith, M. W., and F. C. Neidhardt. 1983. Proteins induced by anaerobiosis in Escherichia coli. J. Bacteriol. 154:336–343.
54. Smith, M. W., and F. C. Neidhardt. 1983. Proteins induced by aerobiosis in Escherichia coli. J. Bacteriol. 154:344–350.
55. Squires, C. L., S. Petersen, B. M. Ross, and C. Squires. 1991. C1pB is the Escherichia coli heat shock protein F84.1. J. Bacteriol. 173:4254–4262.
56. Studier, F. W., and B. A. Moffatt. 1986. Use of bacteriophage T7 RNA polymerase of direct selective high-level expression of cloned genes. J. Mol. Biol. 189:113–130.
57. Tilly, K., R. A. VanBogelen, C. Georgopoulis, and F. C. Neidhardt. 1983. Identification of the heat-inducible protein C15.4 as the groES gene product in Escherichia coli. J. Bacteriol. 154:1505–1507.
58. VanBogelen, R. A., M. E. Hutton, and F. C. Neidhardt. 1990. Gene protein database of Escherichia coli K-12: edition 3. Electrophoresis 11:1131–1166.
59. VanBogelen, R. A., P. M. Kelley, and F. C. Neidhardt. 1987. Differential induction of heat shock, SOS, and oxidation stress regulons and accumulation of nucleotides in Escherichia coli. J. Bacteriol. 169:26–32.
60. VanBogelen, R. A., and F. C. Neidhardt. 1990. Ribosomes as sensors of heat and cold shock in Escherichia coli. Proc. Natl. Acad. Sci. USA 87:5589–5593.
61. VanBogelen, R. A., and F. C. Neidhardt. 1991. The gene-protein database of Escherichia coli K-12: edition 4. Electrophoresis 12:955–994.
62. VanBogelen, R. A., and E. R. Olson. Application of 2-D protein gels in biotechnology. Biotech. Annu. Rev., in press.
63. VanBogelen, R. A., P. Sankar, R. L. Clark, J. A. Bogan, and F. C. Neidhardt. 1992. The gene-protein database of Escherichia coli K-12: edition 5. Electrophoresis 13:1014–1054.
64. Walker, G. C. 1987. The SOS response of Escherichia coli, p. 1346–1357. In F. C. Neidhardt, J. L. Ingraham, K. B. Low, B. Magasanik, M. Schaecter, and H. E. Umbarger (ed.), Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology, vol. 2. American Society for Microbiology, Washington, D.C.
65. Wanner, B. L. 1992. Is cross regulation by phosphorylation of two-component response regulator proteins important in bacteria? J. Bacteriol. 174:2053–2058.