Overview
1. Introduction. 1
2.Methodology. 1
3. Advanced Options and Details. 3
a. PDB 3D
structure. 3
b. Searching
for homologous sequences. 3
c. Generating
the phylogenetic tree. 5
d. Calculating
the amino acid conservation scores. 5
e. Analyzing
nucleic acid sequences. 6
f. Confidence
interval of the inferred conservation scores. 6
g. Model of
substitution for proteins. 7
h. Conservation
Scores. 8
i. Coloring
Scheme. 8
J. Dealing
with unreliable positions: 8
4. Output 10
5. Example. 11
6. References. 13
The ConSurf server [1] is a bioinformatics tool for
estimating the evolutionary conservation of amino/nucleic acid positions in a
protein/DNA/RNA molecule based on the phylogenetic relations between homologous
sequences. The degree to which an amino (or nucleic)
acid position is evolutionarily conserved is strongly dependent on its
structural and functional importance; rapidly evolving positions are variable
while slowly evolving positions are conserved. Thus, conservation analysis of
positions among members from the same family can often reveal the importance of
each position for the protein (or nucleic acid)'s structure or function. In ConSurf, the evolutionary rate is
estimated based on the evolutionary relatedness between the protein (DNA/RNA)
and its homologues and considering the similarity between amino (nucleic) acids
as reflected in the substitutions matrix [2,3]. One of the advantages of
ConSurf in comparison to other methods is the accurate computation of the
evolutionary rate by using either an empirical Bayesian method or a maximum
likelihood (ML) method [3].
Given the amino or nucleic acid sequence (can be extracted
from the 3D structure), ConSurf carries out a search for close homologous sequences
using BLAST (or PSI-BLAST) [4,5]. The user may select one of several databases
and specify criteria for defining homologues. The user may also select the
desired sequences from the BLAST results. The sequences are clustered and
highly similar sequences are removed using CD-HIT [6]. A multiple sequence
alignment (MSA) of the homologous sequences is constructed using MAFFT,PRANK,
T-COFFEE, MUSCLE(default) or CLUSTALW. The MSA is then used to build a
phylogenetic tree using the neighbor-joining algorithm as implemented in the
Rate4Site program [7]. Position-specific conservation scores are computed using
the empirical Bayesian or ML algorithms [2,3]. The continuous conservation
scores are divided into a discrete scale of nine grades for visualization, from
the most variable positions (grade 1) colored turquoise, through intermediately
conserved positions (grade 5) colored white, to the most conserved positions
(grade 9) colored maroon. The conservation scores are projected onto the
protein/nucleotide sequence and on the MSA (A flowchart of ConSurf is shown in
Fig 1 and detailed below):

Figure
1: A flowchart of ConSurf protocol.
a. PDB 3D structure
Three-dimensional structures of biological
macromolecules are available in the Protein Data
Bank (PDB)[8], a copy of which
is accessible from the ConSurf server. The PSI-BLAST [5]
search for homologues is done using the target chain sequence extracted from
the SEQRES record
of the PDB file as the input query. If the SEQRES record does not exist,
ConSurf extracts the sequence from the ATOM record.
Thus, a user-provided PDB file must
include the ATOM record.
b. Searching for homologous sequences
The server uses the
PSI-BLAST [5] heuristic algorithm with default parameters to collect homologous
sequences of a single polypeptide chain of known 3D-structure. The search can
be carried out using the following databases:
1. SWISS-PROT - a curated protein
sequence database which strives to provide a high level of annotation
(default).
2. UniProt - the universal protein resource,
a central repository of protein data created by combining Swiss-Prot, TrEMBL
and PIR.
3. Clean UniProt - a modified version of
the UniProt database aimed to screen the more reliable sequences.
4.
UniRef90 - consists of
cluster sequences and sub-fragments with 11 or more residues that have at least
90% sequence identity with each other (from any organism) into a single UniRef
entry, displaying the sequence of a representative.
5.
NR database -
'non-redundant' database (i.e. with duplicated sequences removed). NR
contains non-redundant sequences from GenBank together with sequences from
other databanks (Refseq, PDB, SwissProt, PIR and PRF).
The User can also set the
maximum number of homologs to collect and the number of iterations while the
default run is a single iteration of PSI-BLAST, with maximum of 150 homologs
and E-value cutoff of 0.001 (E-value describes the number of hits one can
"expect" to see just by chance when searching a database of a
particular size. The higher the E-value, the more hits will be expected, but
the pairwise distance between them and the query sequence will increase). The
user can also control the maximal percentage identity between sequences,
removing redundant sequences and specify the level of redundant sequences for
removal. The sequences found are clustered by their level of identity using
CD-HIT [6] and the cutoff specified by the user (default level is 95%
identity). The minimal percentage for homologs is set by default to 35% which
is the level of the upper bound of the 'twilight zone' for protein structures
[9]. See figure 2 for user's MSA parameters screen on the web server.

Figure 2: Screenshot of the parameters window to build
the Multiple Sequence Alignment (MSA) on the Consurf web server.
Note: The best
performance is obtained when single domains are used. Using a polypeptide chain
that contains several domains often results in finding proteins that are
homologous to one of the domains but not to the others. Therefore it is
advisable to analyze multi-domain proteins one domain at a time.
c.
Generating the phylogenetic tree
The server constructs a
phylogenetic tree that is consistent with the available MSA, using the neighbor
joining (NJ) algorithm [10], as implemented in the Rate4Site program [7]. The
server can also accept a user-provided phylogenetic tree (in Newick format).
This option can be used only if a corresponding MSA is provided.
d.
Calculating the amino acid conservation
scores
The conservation score at
a site corresponds to the site's evolutionary rate. The rate of evolution is
not constant among amino (nucleic) acid sites: some positions evolve slowly and
are commonly referred to as "conserved", while others evolve rapidly
and are referred to as "variable". The rate variations correspond to
different levels of purifying selection acting on these sites. The purifying
selection can be the result of geometrical constraints on the folding of the
protein into its 3D structure, constraints at amino acid sites involved in
enzymatic activity or in ligand binding or, alternatively, at amino acid sites
that take part in protein-protein interactions.
In ConSurf, the rate of
evolution at each site is calculated using either the empirical
Bayesian [11] or the Maximum Likelihood [12] paradigm. In both of these methods,
the stochastic process underlying the sequence evolution and the phylogenetic
tree are explicitly taken into account. The Bayesian method was shown to
significantly improve the accuracy of conservation scores estimations over the
Maximum Likelihood method, in particular when a small number of sequences are
used for the calculations [11] . An additional advantage of the
Bayesian method is that a confidence interval is assigned to each of the
inferred evolutionary conservation score.
e.
Analyzing nucleic acid sequences
Despite increasing
interest in the non-coding fraction of transcriptomes, the number, the level of
conservation, and functions, if any, of many non-protein-coding transcripts remain
to be discovered. However, it has already been shown that many of the
non-coding sequences are connected to regulatory processes. The new version of
ConSurf offers estimations of the evolutionary rate for each position of
nucleic acid sequences in the same manner used for amino acid residues. For
that purpose, four evolutionary models were implemented in the Rate4Site
program: (i) the Juke and Cantor 69 model (JC69), which assumes equal base
frequencies and equal substitution rates [13]. (ii) The Tamura 92 model that
uses only one parameter, which captures variation in G-C content (18). (iii)
The HKY85 model, which distinguishes between transitions and transversions and
allows unequal base frequencies [14]. (iv) The General Time Reversible (GTR)
model, which is the most general time-reversible model. The GTR parameters
consist of an equilibrium base frequency vector, giving the frequency at which
each base occurs at each site, and the rate matrix [15]. When enough data (i.e.
sequences) are available, the GTR model is superior over the more simplified
Tamura 92 model. However, the Tamura 92 model is recommended in cases in which
the data are not sufficient for reliable estimation of the model parameters and
thus it is the default option for analyzing nucleic acid sequences in ConSurf.
f.
Confidence interval of the inferred
conservation scores
Positions
in the MSA exhibiting too little variation caused by too few sequences or too
little diversity among sequence homologs can render evolutionary analysis
meaningless [16]. Using the Bayesian method to calculate evolutionary
conservation, confidence intervals for the conservation scores estimations are
obtained [17]. When the number of sequences is
small, the confidence interval tends to be large, meaning a low level of
support to the inferred conservation score. When the number of sequences
increases, confidence intervals become smaller, and the point score estimates
are more assured. In ConSurf, a confidence interval is assigned to each of the
inferred evolutionary conservation scores. The confidence interval is defined
by the lower and upper quartiles (the 25th and 75th percentiles of the inferred
evolutionary rate distribution, respectively). This measure gives the 50%
confidence interval and also indicates on the dispersion of each of the
estimated scores. Amino acid positions that are assigned confidence intervals
that are too large to be trustworthy are marked in the output files of the
server.
g.
Model of substitution for proteins
The inference of
evolutionary conservation relies on a specified probabilistic model of
amino-acid replacements (7). The server supports a few models of
substitution for nuclear DNA-encoded proteins as well as models of non-nuclear
DNA-encoded proteins. The model of substitution can be chosen from the
"Evolutionary Substitution Model" drop-down list (see figure 3). The
JTT [17], Dayhoff [18] and WAG [19] matrices are
suited for nuclear DNA-encoded proteins. The WAG matrix has been inferred from
a large database of sequences comprising a broad range of protein families and
is thus suited for distantly related amino acid sequences [19]. The
mtREV [20] and cpREV [21] matrices are suitable for
mitochondrial, and chloroplast DNA-encoded proteins, respectively. A recent new
substitution matrix was added called LG [22], which incorporates
variability of evolutionary rates across sites in the matrix, and was shown to
outperform other substitutions matrices for proteins. The LG matrix was added
to Rate4Site and is offered in the new version of ConSurf in addition to the
previous substitution models: JTT [17], Dayhoff [18],

Figure 3: screenshot of Evolutionary Substitution Model parameters on
the ConSurf Web Server
h.
Conservation Scores
The conservation scores
calculated by ConSurf appear in the SCORE column in the "Amino Acid
Conservation Score" output file. The scores are normalized, so that the
average score for all residues is zero, and the standard deviation is one. The
conservation scores calculated by ConSurf are a relative measure of
evolutionary conservation at each sequence site of the target chain. The lowest
score represents the most conserved position in a protein (DNA/RNA). It does
not necessarily indicate 100% conservation (e.g. no mutations at all), but
rather indicates that this position is the most conserved in this specific
protein (DNA/RNA) calculated using a specific MSA.
i.
Coloring Scheme
The continuous
conservation scores are partitioned into a discrete scale of 9 bins for
visualization, such that bin 9 contains the most conserved positions and bin 1
contains the most variable positions. The color grades (1-9) are assigned as
follows: The conservation scores below the average (negative values,
which are indicative of slowly evolving, conserved sites) are divided into 4.5
equal intervals. The same 4.5 intervals are used for the scores above the
average (positive values, which are indicative of rapidly evolving, variable
sites). Thus, 9 equally sized categories of conservation are obtained. Because
the conservation distribution is asymmetrical around the average, the range of
grade 1 is extended to include the most variable grades. Colors are then
assigned to the 9 grades for graphic visualization. The
width of each color grade varies for different polypeptide chains using this
procedure. That is, the coloring results of a ConSurf run do not indicate the
absolute magnitudes of evolutionary distances, but rather the relative degree
of conservation of each amino acid position. ConSurf scaling procedure does not
guarantee that grades 1-8 will always be occupied, although grade 9 is always
occupied by at least one residue.
j.
Dealing with unreliable positions:
Conservation scores that
are obtained for positions in the alignment that have less than 6 un-gapped
amino acids are considered to be unreliable. When using the Bayesian method
[11] for the conservation scores calculations, confidence intervals around the
estimated rates are computed. The high and low values of each interval are
assigned color grades according to the 1-9 coloring scheme. If the interval in
a specific position spans 4 or more color grades the score is considered as
unreliable. Such positions are colored light yellow in the graphic
visualization output.
If a 3D structure of the protein (DNA/RNA) is provided:
1.
The nine-color
conservation scores are projected onto the 3D structure of the query protein
and the colored protein structure is shown by FirstGlance in Jmol (http://firstglance.jmol.org).
2.
Scripts for visualizing
the protein colored with ConSurf scores are generated for PyMol (http://www.pymol.org;
[23]), Chimera [24], Jmol (http://www.jmol.org/; [25]) and RasMol [26].
For all cases, ConSurf creates the following outputs:
1.
The sequence and MSA
colored by ConSurf conservation scores.
2.
A text file that
summarizes for each position the normalized score calculated the assigned
color, the reliability estimation (for the Bayesian method) and the amino
acids/nucleotides observed in the respective MSA column.
3.
The sequences selected
for the MSA and the MSA constructed
(unless those files were uploaded by the user).
4.
A file with the
frequency of each amino acid/nucleotide observed
in each column of the MSA.
5.
The evolutionary tree,
which was calculated by the server or
uploaded by the user, is shown using an interactive
Java applet written for that purpose. For
proteins in which the 3D structure was not provided by the
user, an up-to-date version of the Protein Data Bank [8]
is searched for relevant homologues. If a structure
of at least one homologous protein is available, the user
may map the conservation scores on the structure. This
option should ease the procedure for the non-expert users,
who may be unfamiliar with the 3D structure
As an example we provide the main
output of a ConSurf run for the N-terminal region of the GAL4 transcription
factor in yeast (PDB ID: 3COQ, chain A and B) in complex with its DNA
recognition site( Figure 4). The analysis revealed, as expected, that
the functional regions of this protein are highly conserved. For
example, all the cysteines that form the Zn(2)-C6 DNA binding domain (CYS11,
CYS14, CYS21, CYS28, CYS31, CYS38; [27]) were assigned the highest
conservation scores. Likewise, PRO26, which is known to be central for DNA binding
[28] is also highly conserved according to our analysis. In addition, other
amino acid residues, which are in contact with the DNA (i.e. GLN9, LYS17,
LYS18, LYS20, ARG15, LYS23; [29]) are relatively conserved. ConSurf was also applied to nucleic acid sequences
from yeast, which are the known binding sites of GAL4 and their adjacent
neighborhood (Figure 4). As anticipated, the analysis revealed that the
consensus pattern CGG-N11-CCG typical to GAL4 binding site is highly conserved.
An extended full ConSurf analysis of this example is available in the 'GALLERY'
sectionon the ConSurf web site.

Figure 4. A ConSurf analysis for the GAL4 transcription factor and its
DNA binding site. The 3D structure of the N-terminal region of the GAL4
transcription factor in yeast bound to the DNA is presented using a
space-filled model. The amino-acids and the nucleotides are colored by their
conservation grades using the color-coding bar, with turquoise-through-maroon
indicating variable-through-conserved. Positions, for which the inferred
conservation level was assigned with low confidence, are marked with light
yellow. The figure reveals that the functionally important regions on both the
DNA and the protein are highly conserved. The run was carried out using PDB
code 3COQ and the figure was generated using the PyMol [26] script output by
ConSurf.
1.
Glaser, F., Pupko, T., Paz, I., Bell, R.E., Bechor-Shental, D., Martz,
E. and Ben-Tal, N. (2003) Bioinformatics, 19, 163-164.
2.
Mayrose,I., Graur,D., Ben-Tal,N. and Pupko,T. (2004) Mol. Biol. Evol.,
21, 1781-1791.
3.
Pupko,T., Bell,R.E., Mayrose,I., Glaser,F. and Ben-Tal,N. (2002)
Bioinformatics, 18 S71-S77.
4.
Altschul,S.F., Wootton,J.C., Gertz,E.M., Agarwala,R., Morgulis,A.,
Schaffer,A.A. and Yu,Y.K. (2005) FEBS J., 272, 5101-5109.
5.
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W.
and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389-3402.
6.
Li,W. and Godzik,A. (2006) Bioinformatics, 22, 1658-1659.
7.
Pupko, T., Bell, R.E., Mayrose, I., Glaser, F. and Ben-Tal, N. (2002)
Bioinformatics, 18, S71-77.
8.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N.,
Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) Nucleic Acids Res., 28,
235-242.
9.
Rost,B. (1999). Protein Eng., 12, 85-94.
10.
Saitou, N. and Nei, M. (1987) Mol. Biol. Evol., 4, 406-425.
11.
Mayrose, I., Graur, D., Ben-Tal, N. and Pupko, T. (2004) Mol. Biol.
Evol., 21, 1781-1791.
12.
Martz, E. (2002) Trends Biochem. Sci., 27, 107-109.
13. Tamura,K.
(1992) Mol. Biol. Evol., 9, 678-687.
14. Hasegawa,M.,
Kishino,H. and Yano,T. (1985) J. Mol. Evol., 22, 160-174.
15. Tavare,S.
(1986) Lect. Math. Life Sci., 17, 57-86.
16.
Thornton, J.M., Todd, A.E., Milburn, D., Borkakoti, N. and Orengo, C.A.
(2000) Nat. Struct. Biol., 7, 991-994.
17.
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) Comput. Appl.
Biosci., 8, 275-282.
18.
Dayhoff, M.O., Hunt, L.T., Barker, W.C., Schwartz, R.M., Orcutt, B.C.
and Young, C.L. (eds.) (1978) Atlas of Protein Sequence and Structure.
19.
Whelan, S. and Goldman, N. (2001) Mol. Biol. Evol., 18, 691-699.
20.
Adachi, J. and Hasegawa, M. (1996) J. Mol. Evol., 42, 459-468.
21.
Adachi, J., Waddell, P.J., Martin, W. and Hasegawa, M. (2000) J. Mol.
Evol., 50, 348-358.
22. Le,S.Q.
and Gascuel,O. (2008). Mol. Biol. Evol., 25, 1307-1320.
23. DeLano,W.L.
(2008) DeLano Scientific LLC, Palo Alto, CA, USA.
24. Pettersen,E.F.,
Goddard,T.D., Huang,C.C., Couch,G.S., Greenblatt,D.M., Meng,E.C. and
Ferrin,T.E. (2004) J. Comput. Chem., 25, 1605-1612.
25. Herra'
ez,A. (2006) Biochem. Mol. Biol. Educ., 34, 255-261.
26. Sayle,R.A.
and Milner-White,E.J. (1995) RASMOL: biomolecular Trends Biochem. Sci., 20,
374.
27. Pan,T.
and Coleman,J.E. (1990) Proc. Natl Acad. Sci. USA, 87, 2077-2081.
28. Johnston,M.
(1987). Nature, 328, 353-355.
29. Marmorstein,R.,
Carey,M., Ptashne,M. and Harrison,S.C. (1992)Nature, 356, 408-414.