ConSurf Logo The ConSurf Server
Server for the Identification of Functional Regions in Proteins
  HOME   GALLERY   OVERVIEW   QUICK HELP   FAQ   CITING & CREDITS   OLD VERSION   CONSURF-DB   TERMS OF USE
Overview

Table of contents

1. Introduction. 1

2.Methodology. 1

3. Advanced Options and Details. 3

a. PDB 3D structure. 3

b. Searching for homologous sequences. 3

c. Generating the phylogenetic tree. 5

d. Calculating the amino acid conservation scores. 5

e. Analyzing nucleic acid sequences. 6

f. Confidence interval of the inferred conservation scores. 6

g. Model of substitution for proteins. 7

h. Conservation Scores. 8

i. Coloring Scheme. 8

J. Dealing with unreliable positions: 8

4. Output 10

5. Example. 11

6. References. 13

1.   Introduction

The ConSurf server [1] is a bioinformatics tool for estimating the evolutionary conservation of amino/nucleic acid positions in a protein/DNA/RNA molecule based on the phylogenetic relations between homologous sequences. The degree to which an amino (or nucleic) acid position is evolutionarily conserved is strongly dependent on its structural and functional importance; rapidly evolving positions are variable while slowly evolving positions are conserved. Thus, conservation analysis of positions among members from the same family can often reveal the importance of each position for the protein (or nucleic acid)'s structure or function. In ConSurf, the evolutionary rate is estimated based on the evolutionary relatedness between the protein (DNA/RNA) and its homologues and considering the similarity between amino (nucleic) acids as reflected in the substitutions matrix [2,3]. One of the advantages of ConSurf in comparison to other methods is the accurate computation of the evolutionary rate by using either an empirical Bayesian method or a maximum likelihood (ML) method [3].

2.   Methodology

Given the amino or nucleic acid sequence (can be extracted from the 3D structure), ConSurf carries out a search for close homologous sequences using BLAST (or PSI-BLAST) [4,5]. The user may select one of several databases and specify criteria for defining homologues. The user may also select the desired sequences from the BLAST results. The sequences are clustered and highly similar sequences are removed using CD-HIT [6]. A multiple sequence alignment (MSA) of the homologous sequences is constructed using MAFFT,PRANK, T-COFFEE, MUSCLE(default) or CLUSTALW. The MSA is then used to build a phylogenetic tree using the neighbor-joining algorithm as implemented in the Rate4Site program [7]. Position-specific conservation scores are computed using the empirical Bayesian or ML algorithms [2,3]. The continuous conservation scores are divided into a discrete scale of nine grades for visualization, from the most variable positions (grade 1) colored turquoise, through intermediately conserved positions (grade 5) colored white, to the most conserved positions (grade 9) colored maroon. The conservation scores are projected onto the protein/nucleotide sequence and on the MSA (A flowchart of ConSurf is shown in Fig 1 and detailed below):

Figure 1: A flowchart of ConSurf protocol.

3.   Advanced Options and Details

a.      PDB 3D structure
Three-dimensional structures of biological macromolecules are available in the Protein Data Bank (PDB)[8], a copy of which is accessible from the ConSurf server. The PSI-BLAST [5] search for homologues is done using the target chain sequence extracted from the SEQRES record of the PDB file as the input query. If the SEQRES record does not exist, ConSurf extracts the sequence from the ATOM record. Thus, a user-provided PDB file must include the ATOM record. 

 

b.     Searching for homologous sequences

The server uses the PSI-BLAST [5] heuristic algorithm with default parameters to collect homologous sequences of a single polypeptide chain of known 3D-structure. The search can be carried out using the following databases:

1.     SWISS-PROT - a curated protein sequence database which strives to provide a high level of annotation (default).

2.     UniProt - the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR.

3.     Clean UniProt - a modified version of the UniProt database aimed to screen the more reliable sequences.

4.      UniRef90 - consists of cluster sequences and sub-fragments with 11 or more residues that have at least 90% sequence identity with each other (from any organism) into a single UniRef entry, displaying the sequence of a representative.

5.      NR database - 'non-redundant' database (i.e. with duplicated sequences removed). NR contains non-redundant sequences from GenBank together with sequences from other databanks (Refseq, PDB, SwissProt, PIR and PRF).

The User can also set the maximum number of homologs to collect and the number of iterations while the default run is a single iteration of PSI-BLAST, with maximum of 150 homologs and E-value cutoff of 0.001 (E-value describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. The higher the E-value, the more hits will be expected, but the pairwise distance between them and the query sequence will increase). The user can also control the maximal percentage identity between sequences, removing redundant sequences and specify the level of redundant sequences for removal. The sequences found are clustered by their level of identity using CD-HIT [6] and the cutoff specified by the user (default level is 95% identity). The minimal percentage for homologs is set by default to 35% which is the level of the upper bound of the 'twilight zone' for protein structures [9]. See figure 2 for user's MSA parameters screen on the web server.

 

Figure 2: Screenshot of the parameters window to build the Multiple Sequence Alignment (MSA) on the Consurf web server.

Note: The best performance is obtained when single domains are used. Using a polypeptide chain that contains several domains often results in finding proteins that are homologous to one of the domains but not to the others. Therefore it is advisable to analyze multi-domain proteins one domain at a time.

c.      Generating the phylogenetic tree

The server constructs a phylogenetic tree that is consistent with the available MSA, using the neighbor joining (NJ) algorithm [10], as implemented in the Rate4Site program [7]. The server can also accept a user-provided phylogenetic tree (in Newick format). This option can be used only if a corresponding MSA is provided.

 

d.     Calculating the amino acid conservation scores 

The conservation score at a site corresponds to the site's evolutionary rate. The rate of evolution is not constant among amino (nucleic) acid sites: some positions evolve slowly and are commonly referred to as "conserved", while others evolve rapidly and are referred to as "variable". The rate variations correspond to different levels of purifying selection acting on these sites. The purifying selection can be the result of geometrical constraints on the folding of the protein into its 3D structure, constraints at amino acid sites involved in enzymatic activity or in ligand binding or, alternatively, at amino acid sites that take part in protein-protein interactions. 

 

In ConSurf, the rate of evolution at each site is calculated using either the empirical Bayesian [11] or the Maximum Likelihood [12]  paradigm. In both of these methods, the stochastic process underlying the sequence evolution and the phylogenetic tree are explicitly taken into account. The Bayesian method was shown to significantly improve the accuracy of conservation scores estimations over the Maximum Likelihood method, in particular when a small number of sequences are used for the calculations [11] . An additional advantage of the Bayesian method is that a confidence interval is assigned to each of the inferred evolutionary conservation score.


e.      Analyzing nucleic acid sequences

Despite increasing interest in the non-coding fraction of transcriptomes, the number, the level of conservation, and functions, if any, of many non-protein-coding transcripts remain to be discovered. However, it has already been shown that many of the non-coding sequences are connected to regulatory processes. The new version of ConSurf offers estimations of the evolutionary rate for each position of nucleic acid sequences in the same manner used for amino acid residues. For that purpose, four evolutionary models were implemented in the Rate4Site program: (i) the Juke and Cantor 69 model (JC69), which assumes equal base frequencies and equal substitution rates [13]. (ii) The Tamura 92 model that uses only one parameter, which captures variation in G-C content (18). (iii) The HKY85 model, which distinguishes between transitions and transversions and allows unequal base frequencies [14]. (iv) The General Time Reversible (GTR) model, which is the most general time-reversible model. The GTR parameters consist of an equilibrium base frequency vector, giving the frequency at which each base occurs at each site, and the rate matrix [15]. When enough data (i.e. sequences) are available, the GTR model is superior over the more simplified Tamura 92 model. However, the Tamura 92 model is recommended in cases in which the data are not sufficient for reliable estimation of the model parameters and thus it is the default option for analyzing nucleic acid sequences in ConSurf.

f.        Confidence interval of the inferred conservation scores

Positions in the MSA exhibiting too little variation caused by too few sequences or too little diversity among sequence homologs can render evolutionary analysis meaningless [16]. Using the Bayesian method to calculate evolutionary conservation, confidence intervals for the conservation scores estimations are obtained [17]. When the number of sequences is small, the confidence interval tends to be large, meaning a low level of support to the inferred conservation score. When the number of sequences increases, confidence intervals become smaller, and the point score estimates are more assured. In ConSurf, a confidence interval is assigned to each of the inferred evolutionary conservation scores. The confidence interval is defined by the lower and upper quartiles (the 25th and 75th percentiles of the inferred evolutionary rate distribution, respectively). This measure gives the 50% confidence interval and also indicates on the dispersion of each of the estimated scores. Amino acid positions that are assigned confidence intervals that are too large to be trustworthy are marked in the output files of the server. 

 

g.     Model of substitution for proteins

The inference of evolutionary conservation relies on a specified probabilistic model of amino-acid replacements (7). The server supports a few models of substitution for nuclear DNA-encoded proteins as well as models of non-nuclear DNA-encoded proteins. The model of substitution can be chosen from the "Evolutionary Substitution Model" drop-down list (see figure 3). The JTT [17], Dayhoff [18] and WAG [19] matrices are suited for nuclear DNA-encoded proteins. The WAG matrix has been inferred from a large database of sequences comprising a broad range of protein families and is thus suited for distantly related amino acid sequences [19]. The mtREV [20] and cpREV [21] matrices are suitable for mitochondrial, and chloroplast DNA-encoded proteins, respectively. A recent new substitution matrix was added called LG [22], which incorporates variability of evolutionary rates across sites in the matrix, and was shown to outperform other substitutions matrices for proteins. The LG matrix was added to Rate4Site and is offered in the new version of ConSurf in addition to the previous substitution models: JTT [17], Dayhoff [18],


Figure 3: screenshot of Evolutionary Substitution Model parameters on the ConSurf Web Server

h.     Conservation Scores

The conservation scores calculated by ConSurf appear in the SCORE column in the "Amino Acid Conservation Score" output file. The scores are normalized, so that the average score for all residues is zero, and the standard deviation is one. The conservation scores calculated by ConSurf are a relative measure of evolutionary conservation at each sequence site of the target chain. The lowest score represents the most conserved position in a protein (DNA/RNA). It does not necessarily indicate 100% conservation (e.g. no mutations at all), but rather indicates that this position is the most conserved in this specific protein (DNA/RNA) calculated using a specific MSA.

i.        Coloring Scheme

The continuous conservation scores are partitioned into a discrete scale of 9 bins for visualization, such that bin 9 contains the most conserved positions and bin 1 contains the most variable positions. The color grades (1-9) are assigned as follows:  The conservation scores below the average (negative values, which are indicative of slowly evolving, conserved sites) are divided into 4.5 equal intervals. The same 4.5 intervals are used for the scores above the average (positive values, which are indicative of rapidly evolving, variable sites). Thus, 9 equally sized categories of conservation are obtained. Because the conservation distribution is asymmetrical around the average, the range of grade 1 is extended to include the most variable grades. Colors are then assigned to the 9 grades for graphic visualization.  The width of each color grade varies for different polypeptide chains using this procedure. That is, the coloring results of a ConSurf run do not indicate the absolute magnitudes of evolutionary distances, but rather the relative degree of conservation of each amino acid position. ConSurf scaling procedure does not guarantee that grades 1-8 will always be occupied, although grade 9 is always occupied by at least one residue.

j.        Dealing with unreliable positions:

Conservation scores that are obtained for positions in the alignment that have less than 6 un-gapped amino acids are considered to be unreliable. When using the Bayesian method [11] for the conservation scores calculations, confidence intervals around the estimated rates are computed. The high and low values of each interval are assigned color grades according to the 1-9 coloring scheme. If the interval in a specific position spans 4 or more color grades the score is considered as unreliable. Such positions are colored light yellow in the graphic visualization output.

4.   Output

If a 3D structure of the protein (DNA/RNA) is provided:

1.      The nine-color conservation scores are projected onto the 3D structure of the query protein and the colored protein structure is shown by FirstGlance in Jmol (http://firstglance.jmol.org).

2.      Scripts for visualizing the protein colored with ConSurf scores are generated for PyMol (http://www.pymol.org; [23]), Chimera [24], Jmol (http://www.jmol.org/; [25]) and RasMol [26].

For all cases, ConSurf creates the following outputs:

1.      The sequence and MSA colored by ConSurf conservation scores.

2.      A text file that summarizes for each position the normalized score calculated the assigned color, the reliability estimation (for the Bayesian method) and the amino acids/nucleotides observed in the respective MSA column.

3.      The sequences selected for the MSA and the MSA constructed (unless those files were uploaded by the user).

4.      A file with the frequency of each amino acid/nucleotide observed in each column of the MSA.

5.      The evolutionary tree, which was calculated by the server or uploaded by the user, is shown using an interactive Java applet written for that purpose. For proteins in which the 3D structure was not provided by the user, an up-to-date version of the Protein Data Bank [8] is searched for relevant homologues. If a structure of at least one homologous protein is available, the user may map the conservation scores on the structure. This option should ease the procedure for the non-expert users, who may be unfamiliar with the 3D structure

5.   Example

As an example we provide the main output of a ConSurf run for the N-terminal region of the GAL4 transcription factor in yeast (PDB ID: 3COQ, chain A and B) in complex with its DNA recognition site( Figure 4). The analysis revealed, as expected, that the functional regions of this protein are highly conserved. For example, all the cysteines that form the Zn(2)-C6 DNA binding domain (CYS11, CYS14, CYS21, CYS28, CYS31, CYS38; [27]) were assigned the highest conservation scores. Likewise, PRO26, which is known to be central for DNA binding [28] is also highly conserved according to our analysis. In addition, other amino acid residues, which are in contact with the DNA (i.e. GLN9, LYS17, LYS18, LYS20, ARG15, LYS23; [29]) are relatively conserved. ConSurf was also applied to nucleic acid sequences from yeast, which are the known binding sites of GAL4 and their adjacent neighborhood (Figure 4). As anticipated, the analysis revealed that the consensus pattern CGG-N11-CCG typical to GAL4 binding site is highly conserved. An extended full ConSurf analysis of this example is available in the 'GALLERY' sectionon the ConSurf web site.

 
Figure 4
. A ConSurf analysis for the GAL4 transcription factor and its DNA binding site. The 3D structure of the N-terminal region of the GAL4 transcription factor in yeast bound to the DNA is presented using a space-filled model. The amino-acids and the nucleotides are colored by their conservation grades using the color-coding bar, with turquoise-through-maroon indicating variable-through-conserved. Positions, for which the inferred conservation level was assigned with low confidence, are marked with light yellow. The figure reveals that the functionally important regions on both the DNA and the protein are highly conserved. The run was carried out using PDB code 3COQ and the figure was generated using the PyMol [26] script output by ConSurf.

6.   References

1.       Glaser, F., Pupko, T., Paz, I., Bell, R.E., Bechor-Shental, D., Martz, E. and Ben-Tal, N. (2003) Bioinformatics, 19, 163-164.

2.       Mayrose,I., Graur,D., Ben-Tal,N. and Pupko,T. (2004) Mol. Biol. Evol., 21, 1781-1791.

3.       Pupko,T., Bell,R.E., Mayrose,I., Glaser,F. and Ben-Tal,N. (2002) Bioinformatics, 18 S71-S77.

4.       Altschul,S.F., Wootton,J.C., Gertz,E.M., Agarwala,R., Morgulis,A., Schaffer,A.A. and Yu,Y.K. (2005) FEBS J., 272, 5101-5109.

5.       Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389-3402.

6.       Li,W. and Godzik,A. (2006) Bioinformatics, 22, 1658-1659.

7.       Pupko, T., Bell, R.E., Mayrose, I., Glaser, F. and Ben-Tal, N. (2002) Bioinformatics, 18, S71-77.

8.       Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) Nucleic Acids Res., 28, 235-242.

9.       Rost,B. (1999). Protein Eng., 12, 85-94.

10.    Saitou, N. and Nei, M. (1987) Mol. Biol. Evol., 4, 406-425.

11.   Mayrose, I., Graur, D., Ben-Tal, N. and Pupko, T. (2004) Mol. Biol. Evol., 21, 1781-1791.

12.   Martz, E. (2002) Trends Biochem. Sci., 27, 107-109.

13.   Tamura,K. (1992) Mol. Biol. Evol., 9, 678-687.

14.   Hasegawa,M., Kishino,H. and Yano,T. (1985) J. Mol. Evol., 22, 160-174.

15.   Tavare,S. (1986) Lect. Math. Life Sci., 17, 57-86.

16.   Thornton, J.M., Todd, A.E., Milburn, D., Borkakoti, N. and Orengo, C.A. (2000) Nat. Struct. Biol., 7, 991-994.

17.   Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) Comput. Appl. Biosci., 8, 275-282.

18.   Dayhoff, M.O., Hunt, L.T., Barker, W.C., Schwartz, R.M., Orcutt, B.C. and Young, C.L. (eds.) (1978) Atlas of Protein Sequence and Structure.

19.   Whelan, S. and Goldman, N. (2001) Mol. Biol. Evol., 18, 691-699.

20.   Adachi, J. and Hasegawa, M. (1996) J. Mol. Evol., 42, 459-468.

21.   Adachi, J., Waddell, P.J., Martin, W. and Hasegawa, M. (2000) J. Mol. Evol., 50, 348-358.

22.   Le,S.Q. and Gascuel,O. (2008). Mol. Biol. Evol., 25, 1307-1320.

23.   DeLano,W.L. (2008) DeLano Scientific LLC, Palo Alto, CA, USA.

24.   Pettersen,E.F., Goddard,T.D., Huang,C.C., Couch,G.S., Greenblatt,D.M., Meng,E.C. and Ferrin,T.E. (2004) J. Comput. Chem., 25, 1605-1612.

25.   Herra' ez,A. (2006) Biochem. Mol. Biol. Educ., 34, 255-261.

26.   Sayle,R.A. and Milner-White,E.J. (1995) RASMOL: biomolecular Trends Biochem. Sci., 20, 374.

27.   Pan,T. and Coleman,J.E. (1990) Proc. Natl Acad. Sci. USA, 87, 2077-2081.

28.   Johnston,M. (1987). Nature, 328, 353-355.

29.   Marmorstein,R., Carey,M., Ptashne,M. and Harrison,S.C. (1992)Nature, 356, 408-414.