|
|
Mutual interactions
between proteins and peptides, nucleic acids or ligands play a vital
role in every biological process. Thus, a detailed understanding of the
mechanism of these processes requires the identification of
functionally important amino acids at the protein surface that are
responsible for these interactions. The ConSurf server (1)
is a useful
and user-friendly tool that enables the identification of functionally
important regions on the surface of a protein or domain, of known
three-dimensional (3D) structure, based on the phylogenetic relations
between its close sequence homologues.
Methodology Given the 3D-structure of a protein or a domain as an input, the server extracts the sequence from the PDB (2) file and automatically carries out a search for close homologous sequences of the protein of known structure using PSI-BLAST (3). It then aligns them using MUSCLE (5), (default). The user can choose to perform the multiple sequence alignment using CLUSTALW (4). The multiple sequence alignment is then used to build a phylogenetic tree consistent with the multiple sequence alignment (MSA) using the neighbor joining (NJ) algorithm (6), as implemented in the Rate4Site program (7), and calculates the conservation scores using either an empirical Bayesian (8) or the Maximum Likelihood (7) method. Alternatively, a user-provided MSA and a phylogenetic tree can be processed. In this case, the PSI-BLAST (3) search, the alignment computed by MUSCLE (5) or CLUSTALW (4), and the reconstruction of the tree are skipped. The protein, with the conservation scores color-coded onto its surface, can finally be visualized on-line using FirstGlance in Jmol.
More information about standard PDB
files is available
on the PDB
File Format Contents Guide Searching for homologous sequences The server uses the PSI-BLAST(3)
heuristic algorithm with default parameters
to collect homologous sequences of a single polypeptide chain of known
3D-structure. The search is carried out using the SWISS-PROT
database(10) or the full UNI-PROT
knowledgebase (SWISS-PROT
+ TrEMBL)
databases
and a default single iteration of PSI-BLAST
with a E-value
cutoff of 0.001.The E-value is a parameter that describes the number of
hits one can "expect" to see just by chance when searching a database
of a particular size. The higher the E-value, the more hits will be
expected, but the pairwise distance between them and the query sequence
will increase. Both the number of iterations and the E-value cutoff can
be changed from the server Home Page. The homologue names extracted from
PSI-BLAST output file are the original SWISS-PROT key names found by
PSI-BLAST (e.g. XXXX_YYY). If more than one match is found for the same
sequence, the first one conserves its original SWISS-PROT name while
the others receive a sequential number (e.g. XXXX_YYY_1). The best performance is obtained when
single domains are used. Using a polypeptide chain that contains
several domains often results in finding proteins that are homologous
to one of the domains but not to the others. Therefore it is advisable
to analyze multi-domain proteins one domain at a time. Generating the multiple sequence alignment The server uses MUSCLE (5)
with default parameters to align the homologues extracted from
the PSI-BLAST output file. MUSCLE
(5) was shown to achieve the highest
scores on four alignment accuracy benchmarks (relevant to 2004) (5).
Thus, in the ConSurf version 3, MUSCLE (5) replaced CLUSTALW
(4) as the default algorithm to
compute the multiple sequence
alignments. The user can also choose to perform the alignment using CLUSTALW
(4) with default parameters. The
server can also
accept a user-provided multiple sequence alignments in the 7 formats
supported by CLUSTALW
(4). These are: NBRF/PIR,
EMBL/SwissProt, Pearson
(Fasta), GDE, Clustal, GCG/MSF and RSF formats. For more information on
these formats check The
Format Converter.
The server constructs a phylogenetic
tree that is consistent with the available MSA, using the neighbor joining
(NJ) algorithm (6), as implemented in the Rate4Site
program (7).
The server can also accept a user-provided phylogenetic tree (in Newick
format). This option can be used only if a corresponding MSA is
provided. Calculating the amino acid conservation scores The
conservation score at a site corresponds to the site's evolutionary
rate. The rate of evolution is not constant among amino acid sites:
some positions evolve slowly and are commonly referred to as
"conserved", while others evolve rapidly and are referred to as
"variable". The rate variations correspond to different levels of
purifying selection acting on these sites. The purifying selection can
be the result of geometrical constraints on the folding of the protein
into its 3D structure, constraints at amino acid sites involved in
enzymatic activity or in ligand binding or, alternatively, at amino
acid sites that take part in protein-protein interactions.
Positions in the
MSA exhibiting too little variation caused by too few sequences or too
little diversity among sequence homologs can render evolutionary
analysis meaningless (12). Using the Bayesian method
to calculate
evolutionary conservation, confidence intervals for the conservation
scores estimations are obtained (11). When the
number of sequences is
small, the confidence interval tends to be large, meaning a low level
of support to the inferred conservation score. When the number of
sequences increases, confidence intervals become smaller, and the point
score estimates are more assured. In ConSurf, a confidence interval is
assigned to each of the inferred evolutionary conservation scores. The
confidence interval is defined by the lower and upper quartiles (the
25th and 75th percentiles of the inferred evolutionary rate
distribution, respectively). This measure gives the 50% confidence
interval and also indicates on the dispersion of each of the estimated
scores. Amino acid positions that are assigned confidence intervals
that are too large to be trustworthy are marked in the output files of
the server. The inference of
evolutionary conservation relies on a specified probabilistic model of
amino-acid replacements (7). The server supports a
few models of
substitution for nuclear DNA-encoded proteins as well as models of
non-nuclear DNA-encoded proteins. The model of substitution can be
chosen from the "Model of substitution for proteins" drop-down list,
which is available in the Advance Options in the server Home Page. The
JTT (13), Dayhoff (14) and WAG (15) matrices are suited for nuclear
DNA-encoded proteins. The WAG matrix has been inferred from a large
database of sequences comprising a broad range of protein families and
is thus suited for distantly related amino acid sequences (15).
The
mtREV (16) and cpREV (17)
matrices are suitable for mitochondrial, and
chloroplast DNA-encoded proteins, respectively. Examples of using the
mtREV matrix for a mitochondrial protein, or the cpREV matrix for a
chloroplast protein, versus using the JTT matrix, are displayed in
Figures 1 and 2, respectively. Examples of using the mtREV matrix for a
mitochondrial protein, or the cpREV matrix for a chloroplast protein,
versus using the JTT matrix, are displayed in Figures 1 and 2,
respectively. It is evident from the pictures that the differences
between ConSurf calculations using different matrices tend to be small
but not negligible. Figure 1:
The conservation pattern obtained using ConSurf for the
dihydrolipoamide dehydrogenase of glycine decarboxylase from Pisum
Sativum (PDB code -1dxl, (19)). The dihydrolipoamide
dehydrogenase is
presented as a space-filled model, and colored according to the
conservation scores. The color-coding bar shows the coloring scheme. A
flavin-adenine dinucleotide, located at the active site, is colored
green. The Bayesian method (8) was applied for the
calculations of the
conservation scores using the JTT matrix (13) (A) or
the mtREV matrix (16) (B) as the model of
substitution for
proteins.
Figure 2: The conservation
pattern obtained using ConSurf for the ferredoxin reductase (PDB code
-1bx0, (20)). The reductase is presented as a
space-filled model, and
colored according to the conservation scores. The color-coding bar
shows the coloring scheme. Sites, for which the inferred conservation
level was assigned with low confidence, are colored light yellow. A
flavin-adenine dinucleotide is colored green, and phosphate and
sulphate ions are colored dark green. The Bayesian method (8)
was
applied for the calculations of the conservation scores using the JTT
matrix (13) (A) or the cpREV matrix (17)
(B) as the model of
substitution for proteins. The conservation scores
calculated by ConSurf appear in the SCORE column in the "Amino Acid Conservation
Score" output file. The scores are normalized, so that the average
score for all residues is zero, and the standard deviation is one. The
conservation scores calculated by ConSurf are a relative measure of
evolutionary conservation at each sequence site of the target chain.
The lowest score represents the most conserved position in a protein.
It does not necessarily indicate 100% conservation (e.g. no mutations
at all), but rather indicates that this position is the most conserved
in this specific protein calculated using a specific MSA.
Coloring Scheme The continuous conservation scores are partitioned into
a discrete scale of 9 bins for visualization, such that bin 9 contains
the most conserved positions and bin 1 contains the most variable
positions. The color grades (1-9) are assigned as follows:
In each run, ConSurf produces an output file called "ConSurf Job Status Page". This file is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear: "View ConSurf Results" with FirstGlance in Jmol or with Protein ExplorerComparison with other servers In this section we compare the ConSurf server results for the pro-apoptotic Bcl-XL protein with two other available web servers: Evolutionary Trace and MSA3D.
Protein Explorer offers the MSA3D service, which accepts a user-provided multiple protein sequence alignment, and uses it to color the 3D protein structure. Evolutionary Trace (ET) server The ET server automates the algorithmic ideas of the Evolutionary Trace(20,21) analysis.
The ConSurf server was able to identify both the Bak binding region and the BH4 binding region using the 53 Protomap input sequences. B - MSA3D server C - ET server
|
||||||||||||||||||||||||||||||||