Journal of Molecular Biology
Regular articleEvolution of protein sequences and structures1
Introduction
Two general models attempt to explain how the tertiary structure of a protein is encoded in its linear sequence of amino acids: (1) the local model, in which fold specificity is coded in just a few critical residues (10–20 % of the sequence); and (2) the global model, in which the fold is formed by interactions involving the entire sequence (Lattman & Rose, 1993). The most obvious confirmation of the local model is the misfolding mutations associated with certain diseases, such as cystic fibrosis (Thomas et al., 1995). The global model is supported by numerous mutation studies which show that most mutations at any position in a protein sequence have no measurable impact on the protein function, and therefore the structure Bowie et al 1990, Lattman and Rose 1993, Matthews 1987.
The local model receives considerable support from examples of structurally similar proteins that do not share significant sequence similarity, e.g. actin and hexokinase (Kabsch & Holmes, 1995). Since actin and hexokinase share similarity of overall structure and ATP-binding sites but lack significant sequence similarity, they are frequently referred to as remote homologues. The structural similarity of remote homologues can be explained as the conservation of certain critical “core” folding residues, as the local model predicts.
If protein folding information is localized to critical residues, as the structures of remote homologues apparently imply, we would expect that the non-critical residues would be poorly constrained during sequence evolution. Such heterogeneity in functional constraint of amino acids would produce proteins with modest sequence similarity, but nearly identical structures. Alternatively, a strong, continuous correlation between sequence and structural similarity would imply that the folding information is distributed throughout the sequence, not localized to particular residues. In a continuous correlation of sequence and structural similarity, each amino acid position contributes to the overall structural similarity.
Early studies by Chothia and Lesk 1986, Chothia and Lesk 1987 showed a strongly non-linear relationship between sequence and structural similarity (Figure 1(a)). Very similar sequences showed modest structural differences, but structural differences increased dramatically as sequence identities dropped below 15–20 %. This observation supports the local model for protein folding; changes in sequences at 80–100 % identity have small effects on structure, but changes at 15–25 % identity, which are more likely to involve critical “core” residues, have a much larger effect. Recent studies have confirmed their findings with larger sets of protein structures Flores et al 1993, Russell et al 1997. All these studies used the percent sequence identity and the root-mean-square difference (RMSD) of superimposed Cα atoms to measure sequence and structural similarity, respectively. The percent sequence identity and RMSD have shortcomings as measures of sequence and structural similarity Brenner et al 1998, Levitt and Gerstein 1998, and thus a new evaluation of the correlation of sequence and structural similarity of homologous proteins is warranted.
To measure the correlation of sequence and structural similarity, we used modern database searching programs to estimate the significance of the sequence and structure similarity for 36 protein families (Table 1) with five or more known structures from sequences that are less than 80 % identical. We find that most of the evolutionary structural change in a protein family is linearly related to changes in sequence similarity, when plotted in terms of statistical significance or as RMSD versus percent identity. Although we detected significant non-linear components in the relationship between sequence and structural similarity, these additional components explained very little of the structural variance, supporting a largely global view of protein fold specificity. The slope of the linear fit of sequence/structure similarity defines how much the structure of a protein is expected to change with a given amount of sequence change. We call this quantity the structural mutation sensitivity and show that it differs among protein families and is not correlated with protein structural class or protein family mutation rate.
Section snippets
Sequence and structural similarity
Although percent sequence identity is routinely used to quantify sequence similarity, it has been known for more than 20 years that similarity scores based solely on sequence identity perform poorly when compared to substitution matrices that recognized conservative substitutions with similar biochemical properties Pearson 1995, Schwartz and Dayhoff 1978; recently, shortcomings in the percent identity measure were demonstrated on a database of sequences whose structures are known (Levitt &
Discussion
We have examined the relationship between sequence change and structural change in 36 protein families with five or more diverse members whose structures are known. For most of the protein families that we examined changes in structural similarity are linearly dependent on changes in sequence similarity. In the globin family Figure 2, Figure 3, a change in a sequence z-score from z = 15 to 25 standard deviations above the mean (24.9 % identity at z = 15 to 30.3 % identity at z = 25) will change
Sequence and structure comparisons
Sequence and structure similarity searches were performed on a database of 1770 (pdb.nr80) sequences for which structures have been determined. This database was produced from the 9039 sequences in the PDB (release 80) by a simple selection and database search process that identified related sequences in the fully redundant database (E ()<10−4) that were 100 % identical (pdb.nr100) or more than 80 % identical (pdb.nr80). The Pdb.nr100 and pdb.nr80 databases are available from //ftp.virginia.edu/pub/fasta
Acknowledgements
This work was supported by a grant from the National Library of Medicine (LM04961). We thank Bob Kretsinger for his careful reading of the manuscript and for helpful suggestions and Frank Harrell for his statistical advice.
References (36)
- et al.
Local alignment statistics
Methods Enzymol
(1996) - et al.
A basic local alignment search tool
J. Mol. Biol
(1990) - et al.
Protein structure comparison by alignment of distance matrices
J. Mol. Biol
(1993) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores
Bull. Math. Biol
(1992)- et al.
SCOPa structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol
(1995) Effective protein sequence comparison
Methods Enzymol
(1996)Empirical statistical estimates for sequence similarity searches
J. Mol. Biol
(1998)- et al.
Recognition of analogous and homologous protein foldsanalysis of sequence and structure conservation
J. Mol. Biol
(1997) - et al.
Identification of common molecular subsequences
J. Mol. Biol
(1981) Detecting structural similaritiesa user’s guide
Methods Enzymol
(1996)
Defective protein folding as a basis of human disease
Trends Biochem. Sci
SARFing the PDB
Protein Eng
Issues in searching molecular sequence databases
Nature Genet
Gapped BLAST and PSI-BLASTa new generation of protein database search programs
Nucl. Acids Res
The Swiss-Prot protein sequence data bank and its new supplement TrEMBL
Nucl. Acids Res
Deciphering the message in protein sequencestolerance to amino acid substitutions
Science
Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships
Proc. Natl Acad. Sci. USA
The relation between the divergence of sequence and structure in proteins
EMBO J
Cited by (0)
- 1
Edited by J. M. Thornton