ConSurf Overview


Introduction

    Mutual interactions between proteins and peptides, nucleic acids or ligands play a vital role in every biological process. Thus, a detailed understanding of the mechanism of these processes requires the identification of functionally important amino acids at the protein surface that are responsible for these interactions. The ConSurf server (1) is a useful and user-friendly tool that enables the identification of functionally important regions on the surface of a protein or domain, of known three-dimensional (3D) structure, based on the phylogenetic relations between its close sequence homologues.
 

Methodology

Given the 3D-structure of a protein or a domain as an input, the server extracts the sequence from the PDB (2) file and automatically carries out a search for close homologous sequences of the protein of known structure using PSI-BLAST (3). It then aligns them using  MUSCLE (5), (default). The user can choose to perform the multiple sequence alignment using CLUSTALW (4). The multiple sequence alignment is then used to build a phylogenetic tree consistent with the multiple sequence alignment (MSA) using the neighbor joining (NJ) algorithm (6), as implemented in the Rate4Site program (7), and calculates the conservation scores using either an empirical Bayesian (8) or the Maximum Likelihood (7) method. Alternatively, a user-provided MSA and a phylogenetic tree can be processed. In this case, the PSI-BLAST (3) search, the alignment computed by MUSCLE (5) or CLUSTALW (4), and the reconstruction of the tree are skipped. The protein, with the conservation scores color-coded onto its surface, can finally be visualized on-line using FirstGlance in Jmol

 
PDB 3D-structure
    Three-dimensional structures of biological macromolecules are available in the Protein Data Bank (PDB)(2), a copy of which is accessible from the ConSurf server. The PSI-BLAST (3)search for homologues is done using the target chain sequence extracted from the SEQRES record of the PDB file as the input query. If the SEQRES record does not exist, ConSurf extracts the sequence from the ATOM record. Thus, a user-provided PDB file must include the ATOM record. 

    More information about standard PDB files is available on the PDB File Format Contents Guide
     

    Searching for homologous sequences

    The server uses the PSI-BLAST(3) heuristic algorithm with default parameters to collect homologous sequences of a single polypeptide chain of known 3D-structure. The search is carried out using the SWISS-PROT database(10) or the full UNI-PROT knowledgebase (SWISS-PROT + TrEMBL) databases and a default single iteration of PSI-BLAST with a E-value cutoff of 0.001.The E-value is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. The higher the E-value, the more hits will be expected, but the pairwise distance between them and the query sequence will increase. Both the number of iterations and the E-value cutoff can be changed from the server Home Page.

    The homologue names extracted from PSI-BLAST output file are the original SWISS-PROT key names found by PSI-BLAST (e.g. XXXX_YYY). If more than one match is found for the same sequence, the first one conserves its original SWISS-PROT name while the others receive a sequential number (e.g. XXXX_YYY_1).

    The best performance is obtained when single domains are used. Using a polypeptide chain that contains several domains often results in finding proteins that are homologous to one of the domains but not to the others. Therefore it is advisable to analyze multi-domain proteins one domain at a time. 
     

    Generating the multiple sequence alignment 

    The server uses MUSCLE (5) with default parameters to align the homologues extracted from the PSI-BLAST output file. MUSCLE (5) was shown to achieve the highest scores on four alignment accuracy benchmarks (relevant to 2004) (5). Thus, in the ConSurf version 3, MUSCLE (5) replaced CLUSTALW (4) as the default algorithm to compute the multiple sequence alignments. The user can also choose to perform the alignment using CLUSTALW (4) with default parameters. The server can also accept a user-provided multiple sequence alignments in the 7 formats supported by CLUSTALW (4). These are: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF and RSF formats. For more information on these formats check The Format Converter. 


    Generating the phylogenetic tree 

    The server constructs a phylogenetic tree that is consistent with the available MSA, using the neighbor joining (NJ) algorithm (6), as implemented in the Rate4Site program (7). The server can also accept a user-provided phylogenetic tree (in Newick format). This option can be used only if a corresponding MSA is provided.
     

    Calculating the amino acid conservation scores 

    The conservation score at a site corresponds to the site's evolutionary rate. The rate of evolution is not constant among amino acid sites: some positions evolve slowly and are commonly referred to as "conserved", while others evolve rapidly and are referred to as "variable". The rate variations correspond to different levels of purifying selection acting on these sites. The purifying selection can be the result of geometrical constraints on the folding of the protein into its 3D structure, constraints at amino acid sites involved in enzymatic activity or in ligand binding or, alternatively, at amino acid sites that take part in protein-protein interactions.
    In ConSurf, the rate of evolution at each site is calculated using either the empirical Bayesian (8) or the Maximum Likelihood (7) paradigm. In both of these methods, the stochastic process underlying the sequence evolution and the phylogenetic tree are explicitly taken into account. The Bayesian method was shown to significantly improve the accuracy of conservation scores estimations over the Maximum Likelihood method, in particular when a small number of sequences are used for the calculations (8). An additional advantage of the Bayesian method is that a confidence interval is assigned to each of the inferred evolutionary conservation score (11). A detailed description of the Bayesian methodology is provided in (8) (PDF).  A detailed description of the Maximum Likelihood methodology is provided in (7) (PDF). Rate4Site, a stand-alone implementation of this algorithm, is available at the URL: Rate4Site.


      a confidence interval of the inferred conservation scores

      Positions in the MSA exhibiting too little variation caused by too few sequences or too little diversity among sequence homologs can render evolutionary analysis meaningless (12). Using the Bayesian method to calculate evolutionary conservation, confidence intervals for the conservation scores estimations are obtained (11). When the number of sequences is small, the confidence interval tends to be large, meaning a low level of support to the inferred conservation score. When the number of sequences increases, confidence intervals become smaller, and the point score estimates are more assured. In ConSurf, a confidence interval is assigned to each of the inferred evolutionary conservation scores. The confidence interval is defined by the lower and upper quartiles (the 25th and 75th percentiles of the inferred evolutionary rate distribution, respectively). This measure gives the 50% confidence interval and also indicates on the dispersion of each of the estimated scores. Amino acid positions that are assigned confidence intervals that are too large to be trustworthy are marked in the output files of the server.

      Model of substitution for proteins

      The inference of evolutionary conservation relies on a specified probabilistic model of amino-acid replacements (7). The server supports a few models of substitution for nuclear DNA-encoded proteins as well as models of non-nuclear DNA-encoded proteins. The model of substitution can be chosen from the "Model of substitution for proteins" drop-down list, which is available in the Advance Options in the server Home Page. The JTT (13), Dayhoff (14) and WAG (15) matrices are suited for nuclear DNA-encoded proteins. The WAG matrix has been inferred from a large database of sequences comprising a broad range of protein families and is thus suited for distantly related amino acid sequences (15). The mtREV (16) and cpREV (17) matrices are suitable for mitochondrial, and chloroplast DNA-encoded proteins, respectively. Examples of using the mtREV matrix for a mitochondrial protein, or the cpREV matrix for a chloroplast protein, versus using the JTT matrix, are displayed in Figures 1 and 2, respectively. Examples of using the mtREV matrix for a mitochondrial protein, or the cpREV matrix for a chloroplast protein, versus using the JTT matrix, are displayed in Figures 1 and 2, respectively. It is evident from the pictures that the differences between ConSurf calculations using different matrices tend to be small but not negligible.

      Fig. 1A
      Fig. 1B

      Colouring scheme

      Figure 1: The conservation pattern obtained using ConSurf for the dihydrolipoamide dehydrogenase of glycine decarboxylase from Pisum Sativum (PDB code -1dxl, (19)). The dihydrolipoamide dehydrogenase is presented as a space-filled model, and colored according to the conservation scores. The color-coding bar shows the coloring scheme. A flavin-adenine dinucleotide, located at the active site, is colored green. The Bayesian method (8) was applied for the calculations of the conservation scores using the JTT matrix (13) (A) or the mtREV matrix (16) (B) as the model of substitution for proteins. 


      Fig. 2A
      Fig. 2B

      Colouring scheme

       Figure 2: The conservation pattern obtained using ConSurf for the ferredoxin reductase (PDB code -1bx0, (20)). The reductase is presented as a space-filled model, and colored according to the conservation scores. The color-coding bar shows the coloring scheme. Sites, for which the inferred conservation level was assigned with low confidence, are colored light yellow. A flavin-adenine dinucleotide is colored green, and phosphate and sulphate ions are colored dark green. The Bayesian method (8) was applied for the calculations of the conservation scores using the JTT matrix (13) (A) or the cpREV matrix (17) (B) as the model of substitution for proteins.

       Conservation Scores

      The conservation scores calculated by ConSurf appear in the SCORE column in the "Amino Acid Conservation Score" output file. The scores are normalized, so that the average score for all residues is zero, and the standard deviation is one. The conservation scores calculated by ConSurf are a relative measure of evolutionary conservation at each sequence site of the target chain. The lowest score represents the most conserved position in a protein. It does not necessarily indicate 100% conservation (e.g. no mutations at all), but rather indicates that this position is the most conserved in this specific protein calculated using a specific MSA.
       

      Coloring Scheme

      The continuous conservation scores are partitioned into a discrete scale of 9 bins for visualization, such that bin 9 contains the most conserved positions and bin 1 contains the most variable positions. The color grades (1-9) are assigned as follows: 
      The conservation scores below the average (negative values, which are indicative of slowly evolving, conserved sites) are divided into 4.5 equal intervals. The same 4.5 intervals are used for the scores above the average (positive values, which are indicative of rapidly evolving, variable sites). Thus, 9 equally sized categories of conservation are obtained. Because the conservation distribution is asymmetrical around the average, the range of grade 1 is extended to include the most variable grades. Colors are then assigned to the 9 grades for graphic visualization.
      The width of each color grade varies for different polypeptide chains using this procedure. That is, the coloring results of a ConSurf run do not indicate the absolute magnitudes of evolutionary distances, but rather the relative degree of conservation of each amino acid position. ConSurf scaling procedure does not guarantee that grades 1-8 will always be occupied, although grade 9 is always occupied by at least one residue.

      Dealing with unreliable positions: Conservation scores, that are obtained for positions in the alignment that have less than 6 un-gapped amino acids are considered to be unreliable. When using the Bayesian method (8) for the conservation scores calculations, confidence intervals around the estimated rates are computed. The high and low values of each interval are assigned color grades according to the 1-9 coloring scheme. If the interval in a specific position spans 4 or more color grades the score is considered as unreliable. Such positions are colored light yellow in the graphic visualization output. 



Outputs

In each run, ConSurf produces an output file called  "ConSurf Job Status Page". This file is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear:

"View ConSurf Results" with FirstGlance in Jmol or with Protein Explorer
This is the main link of this page. It leads to the graphic visualization of the color-coded molecule through the FirstGlance in Jmol interface.

"Amino Acid Conservation Scores, Confidence Intervals and Conservation Colors"
This link includes the conservation scores obtained for each amino acid position of the target sequence.  This output file also includes the color grades for each amino acid site, and other useful data regarding the MSA of the current run. When using the Bayesian method for the conservation scores calculations, the low confidence cutoff are marked with asterisks.

"PSI-BLAST output"
This links to the output file generated by PSI-BLAST, which includes the sequences, their pairwise alignment with the query sequence, etc.

"Unique Sequences Used"
This links to the file including the homologous sequences and their SWISS-PROT code names extracted from the PSI-BLAST output. 

"Multiple Sequence Alignment (in Clustal format)"
This links to the output file produced by CLUSTAL W, or to a file including the user-provide MSA in CLUSTAL format.

"View Phylogenetic Tree"
A graphical representation of the phylogenetic tree generated by Consurf, using the Hickory Java applet; the applet provides an interactive tool for viewing and manipulating the phylogenetic tree.

"Phylogenetic Tree (in Newick format)"
This links to the textual data of the phylogenetic tree.

"RasMol coloring script source" link
This link includes the Chime script commands for coloring the protein according to the conservation grades. This file can be downloaded and used locally with RasMol (18) or FirstGlance in Jmol, producing the same color-coded CPK scheme generated by the server. A similar script commands to color the protein according to conservation scores without low confidence cutoff is also available under the link "RasMol coloring script source without low confidence cutoff".
 

Graphic Visualization

The target protein chain is represented as a space-filling model with the conservation grades color-coded onto each amino acid van-der-Waals surface. All other chains in the PDB file are displayed in backbone representation and all ligands are presented in ball-and-stick representation.

More on this topic can be read here.

 

Comparison with other servers
 

In this section we compare the ConSurf server results for the pro-apoptotic Bcl-XL protein with two other available web servers: Evolutionary Trace and MSA3D.

 
MSA3D server
Protein Explorer offers the MSA3D service, which accepts a user-provided multiple protein sequence alignment, and uses it to color the 3D protein structure. 

MSA3D is based on a consensus approach, which divides the positions of the protein into 3 basic categories: 'identical', 'similar', and 'different'. The user can determine the consensus threshold. A fourth category, 'No info', occurs when gaps appear in the alignment for so many sequences that the consensus level cannot be achieved for the remaining sequences. 
 

Evolutionary Trace (ET) server
The ET server automates the algorithmic ideas of the Evolutionary Trace(20,21) analysis.

In this method, a phylogenetic tree is generated and split into evenly distributed partitions. For each partition the sequences connected by a common node are clustered together. Next, a consensus sequence is generated for each cluster. Then, the consensus sequences for all clusters are compared. 
A position is defined as 'conserved' if all consensus sequences have an invariant residue at that position. A position is 'class-specific' if it is invariant for each cluster, and varies between them. A position is 'neutral' if it is variable in at least one cluster.

 

FIGURE 3.: PDB structure (PDB ID 1bxl) of a complex including Bcl-Xl (chain A) and Bak peptide (chain B) fragment (572 - 587)

In this example we used a 53-homologue file obtained from the ProtoNet database. (formerly "ProtoMap")


 
A - ConSurf server 
B - MSA3D server
C - ET server
Color Scale A

1 2 3 4 5 6 7 8 9
Variable Conserved
Color Scale B

Identical
Similar
Different
Mismatch
Noinfo

Color Scale C

Buried, Class-specific
Buried, Conserved
Exposed, Class-specific
Exposed, Conserved
Neutral


 
    A - ConSurf server
    The ConSurf server was able to identify both the Bak binding region and the BH4 binding region using the 53 Protomap input sequences.

    B - MSA3D server
    The MSA3D server (9) was run using the same MSA as an input  with 10%, 20%, 50% and 100% consensus percentages. The best results were obtained for 50%. For this consensus the Bak binding groove was detected as conserved, but the BH4 region was not. This is presumably since the amino acid sites that correspond to the BH4 region include many gaps that weaken the conservation signal, and make it harder to detect. 

    C - ET server
    The same MSA was used as an input for the Evolutionary Trace server (20,21), with the default number of partitions (11). This run yielded a 'null' result; each and every site appeared to be 'neutral' (white color). This is due to the great divergence of the aligned sequences. The MSA includes many gaps, which make it difficult for the ET server to find a meaningful consensus for each amino acid site. A second analysis was carried out that used only a subset of the sequences, for which the MSA contains fewer gaps. Using this MSA, the ET server successfully identified the Bak binding groove. However, the BH4 region could not be recognized. Similar results were obtained for 2, 5 and 20 partitions (data not shown).  
    In conclusion, the ConSurf server was the only one that did identify both the Bak peptide binding groove and the BH4 homology region.
     
     

Note: A detailed comparison between the Maximum Likelihood and Evolutionary Trace methods is provided at (7) PDF).
 
 

References



Page Top