Regular article
Evolution of protein sequences and structures1

https://doi.org/10.1006/jmbi.1999.2972Get rights and content

Abstract

The relationship between sequence similarity and structural similarity has been examined in 36 protein families with five or more diverse members whose structures are known. The structural similarity within a family (as determined with the DALI structure comparison program) is linearly related to sequence similarity (as determined by a Smith-Waterman search of the protein sequences in the structure database). The correlation between structural similarity and sequence similarity is very high; 18 of the 36 families had linear correlation coefficients r⩾0.878, and only nine had correlation coefficients r⩽0.815. Inclusion of higher-order terms in the structure/sequence relationship improved the fit by less than 7 % in 27 of the 36 families. Differences in sequence/structure correlations are distributed evenly among the four protein structural classes, α, β, α/β, and α+β. While most protein families show high correlations between sequence similarity and structural similarity, the amount of structural change per sequence change, i.e. the structural mutation sensitivity, varies almost fourfold. Protein families with high and low structural mutation sensitivity are distributed evenly among protein structure classes. In addition, we did not detect strong correlations between structural mutation sensitivity and either protein family mutation rates or protein size. Our results are more consistent with models of protein structure that encode a protein family’s fold throughout the protein sequence, and not just in a few critical residues.

Introduction

Two general models attempt to explain how the tertiary structure of a protein is encoded in its linear sequence of amino acids: (1) the local model, in which fold specificity is coded in just a few critical residues (10–20 % of the sequence); and (2) the global model, in which the fold is formed by interactions involving the entire sequence (Lattman & Rose, 1993). The most obvious confirmation of the local model is the misfolding mutations associated with certain diseases, such as cystic fibrosis (Thomas et al., 1995). The global model is supported by numerous mutation studies which show that most mutations at any position in a protein sequence have no measurable impact on the protein function, and therefore the structure Bowie et al 1990, Lattman and Rose 1993, Matthews 1987.

The local model receives considerable support from examples of structurally similar proteins that do not share significant sequence similarity, e.g. actin and hexokinase (Kabsch & Holmes, 1995). Since actin and hexokinase share similarity of overall structure and ATP-binding sites but lack significant sequence similarity, they are frequently referred to as remote homologues. The structural similarity of remote homologues can be explained as the conservation of certain critical “core” folding residues, as the local model predicts.

If protein folding information is localized to critical residues, as the structures of remote homologues apparently imply, we would expect that the non-critical residues would be poorly constrained during sequence evolution. Such heterogeneity in functional constraint of amino acids would produce proteins with modest sequence similarity, but nearly identical structures. Alternatively, a strong, continuous correlation between sequence and structural similarity would imply that the folding information is distributed throughout the sequence, not localized to particular residues. In a continuous correlation of sequence and structural similarity, each amino acid position contributes to the overall structural similarity.

Early studies by Chothia and Lesk 1986, Chothia and Lesk 1987 showed a strongly non-linear relationship between sequence and structural similarity (Figure 1(a)). Very similar sequences showed modest structural differences, but structural differences increased dramatically as sequence identities dropped below 15–20 %. This observation supports the local model for protein folding; changes in sequences at 80–100 % identity have small effects on structure, but changes at 15–25 % identity, which are more likely to involve critical “core” residues, have a much larger effect. Recent studies have confirmed their findings with larger sets of protein structures Flores et al 1993, Russell et al 1997. All these studies used the percent sequence identity and the root-mean-square difference (RMSD) of superimposed Cα atoms to measure sequence and structural similarity, respectively. The percent sequence identity and RMSD have shortcomings as measures of sequence and structural similarity Brenner et al 1998, Levitt and Gerstein 1998, and thus a new evaluation of the correlation of sequence and structural similarity of homologous proteins is warranted.

To measure the correlation of sequence and structural similarity, we used modern database searching programs to estimate the significance of the sequence and structure similarity for 36 protein families (Table 1) with five or more known structures from sequences that are less than 80 % identical. We find that most of the evolutionary structural change in a protein family is linearly related to changes in sequence similarity, when plotted in terms of statistical significance or as RMSD versus percent identity. Although we detected significant non-linear components in the relationship between sequence and structural similarity, these additional components explained very little of the structural variance, supporting a largely global view of protein fold specificity. The slope of the linear fit of sequence/structure similarity defines how much the structure of a protein is expected to change with a given amount of sequence change. We call this quantity the structural mutation sensitivity and show that it differs among protein families and is not correlated with protein structural class or protein family mutation rate.

Section snippets

Sequence and structural similarity

Although percent sequence identity is routinely used to quantify sequence similarity, it has been known for more than 20 years that similarity scores based solely on sequence identity perform poorly when compared to substitution matrices that recognized conservative substitutions with similar biochemical properties Pearson 1995, Schwartz and Dayhoff 1978; recently, shortcomings in the percent identity measure were demonstrated on a database of sequences whose structures are known (Levitt &

Discussion

We have examined the relationship between sequence change and structural change in 36 protein families with five or more diverse members whose structures are known. For most of the protein families that we examined changes in structural similarity are linearly dependent on changes in sequence similarity. In the globin family Figure 2, Figure 3, a change in a sequence z-score from z = 15 to 25 standard deviations above the mean (24.9 % identity at z = 15 to 30.3 % identity at z = 25) will change

Sequence and structure comparisons

Sequence and structure similarity searches were performed on a database of 1770 (pdb.nr80) sequences for which structures have been determined. This database was produced from the 9039 sequences in the PDB (release 80) by a simple selection and database search process that identified related sequences in the fully redundant database (E ()<10−4) that were 100 % identical (pdb.nr100) or more than 80 % identical (pdb.nr80). The Pdb.nr100 and pdb.nr80 databases are available from //ftp.virginia.edu/pub/fasta

Acknowledgements

This work was supported by a grant from the National Library of Medicine (LM04961). We thank Bob Kretsinger for his careful reading of the manuscript and for helpful suggestions and Frank Harrell for his statistical advice.

References (36)

  • P.J Thomas et al.

    Defective protein folding as a basis of human disease

    Trends Biochem. Sci

    (1995)
  • N.N Alexandrov

    SARFing the PDB

    Protein Eng

    (1996)
  • S.F Altschul et al.

    Issues in searching molecular sequence databases

    Nature Genet

    (1994)
  • S.F Altschul et al.

    Gapped BLAST and PSI-BLASTa new generation of protein database search programs

    Nucl. Acids Res

    (1997)
  • A Bairoch et al.

    The Swiss-Prot protein sequence data bank and its new supplement TrEMBL

    Nucl. Acids Res

    (1996)
  • J.U Bowie et al.

    Deciphering the message in protein sequencestolerance to amino acid substitutions

    Science

    (1990)
  • S.E Brenner et al.

    Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships

    Proc. Natl Acad. Sci. USA

    (1998)
  • C Chothia et al.

    The relation between the divergence of sequence and structure in proteins

    EMBO J

    (1986)
  • Cited by (0)

    1

    Edited by J. M. Thornton

    View full text