ConSurf Quick Help
- PDB ID
- User-provided PDB file
- Chain identifier
- Find sequence homologues
- Minimal identity (removing redundant sequences)
- Maximal identity between sequences (automatic removal of remote homologues)
- Maximum number of homologues
- Select sequences manually
- User-provided multiple sequence alignment (MSA)
- User-provided phylogenetic tree file
- Graphic visualization
- Reliability of the results
- Minimal requirements for a successful ConSurf run
Each structure in the PDB is represented by a 4 character alphanumeric identifier, assigned upon its deposition. For example, 1bxl and 1d66 are identification codes for PDB entries for Bcl-Xl / Bak complex and Gal4 (Residues 1 - 65) complex with 19mer DNA, respectively.
For more information check the PDB home page.
User-provided PDB file
The ATOM record is essential for ConSurf run therefore it must be included in the user-provided PDB file; The CSI-BLAST or PSI-BLAST search for homologues is done using the sequence extracted from the SEQRES record, or from the ATOM record in case the SEQRES record is missing.
To run ConSurf you should specify a Chain Identifier (A, B, C, in the corresponding field of the PDB file). If no chain is specified in the PDB file, please type 'none' on the chain field.
One way to get the chain identifier is to display the molecule in FirstGlance in Jmol and click on the chain of interest. The chain identifier is reported in the message box at the lower left frame.
You can also find the chain identifiers of a standard PDB file on the third field (column 12) of the SEQRES records or the sixth field (column 22) of the ATOM records. The chain identifier may be any single legal character, including a blank character, which is used if there is only one chain.
For more information check the PDB File Format Contents Guide
For protein sequence, homologous sequences are collected using a CSI-BLAST, PSI-BLAST or BLAST search against the selected database and the user-specified criteria for defining homologues as described below.
These fields are irrelevant for a user-provided MSA file.
Several databases can be searched for homologous sequences:
SWISS-PROT - a curated protein sequence database which strives to provide a high level of annotation
Clean UniProt - a modified version of the UniProt database aimed to screen the more reliable sequences.
UniRef90 - database cluster sequences and sub-fragments with 11 or more residues that have at least 90% sequence identity with each other (from any organism) into a single UniRef entry, displaying the sequence of a representative
UniProt is the universal protein resource, a central repository of protein data created by combining SWISS-PROT, TrEMBL and PIR.
NCBI nr is all non-redundant GenBank CDS translations + PDB + SWISS-PROT + PIR + PRF.
For nucleic acid sequence the search is performed using the BLAST algorithm with default parameters to collect homologous sequences from the NCBI NT database.
The Expectation value (E-value) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. The lower the E-value, the more significant the score is. The Expectation value is used as a convenient way to create a significance threshold for reporting results. For example, the meaning of an E-value of 1 assigned to a hit is that in a database of the current size one might expect to see 1 match with a similar score simply by chance. Using a higher E-value will generally yield more hits, but their distance from the query sequence will increase.
Minimal identity (automatic removal of remote homologues)
The user can control the level of sequence identity for which a hit sequence is still considered a homologue. Filtration according to the sequence identity between the sequence found and the sequence of interest enables the user to filter out sequences that produce significant alignment with the sequence of interest however might have different function or structure. For proteins, the default level is set to 35% identity, which is the upper bound limit of the 'twilight zone' for proteins structures (Rost, 1999).
Maximal identity between sequences (removing redundant sequences)
The user can specify the level of redundant sequences for removal. The sequences found are clustered by their level of identity using CD-HIT and the cutoff specified by the user (default is 95% identity). Only one sequence from each cluster is used for the analysis.
Maximum number of homologues
The maximum number of homologues, after applying all the filters (described above), to be included in the calculation. In order to include all the homologues remain after those filters, replace the default value with the word "all".
Manual selection of sequences for the analysis
After searching for homologous sequences, the user can manually select the relevant sequences to be included in the analysis using a simple form that provides all the relevant data for the sequences found and links to external web-recourses.
User-provided multiple sequence alignment (MSA)
ConSurf accepts external MSAs in the 7 formats supported by BioPerl.
These are: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF and RSF format. For additional information see Format Converter (use "View in browser" choice) links.
In case you provide an external MSA file, you are required to fill the "Query sequence name in MSA file" field in the ConSurf form. This is the name of the sequence in the provided MSA file that corresponds to the query chain of the selected PDB structure.
In case you provide an external MSA file in Fasta format, please use the "-" sign as the only gap symbol, as this is the only standard gap sign that ConSurf accepts.
In case you provide an external multiple sequence alignment (MSA) file, ConSurf can also accept a corresponding external phylogenetic tree in Newick (Phylip) format.
The names of the sequences in the tree file must be identical to the names of the sequences in the MSA file.
The target protein chain is represented as a space-filling
model with the conservation grades color-coded onto each amino acid
van-der-Waals surface. All other chains in the PDB file are displayed in
backbone representation and all ligands are presented in ball-and-stick
More on this topic can be found here.
Reliability of the results
The quality of the results depends on many factors, but the two most important ones are the quality and number of the homologues. If the input homologues list does not include a minimal number of close enough homologues, the quality of the multiple sequence alignment (MSA) decreases, influencing the quality of the phylogenetic tree and the rest of the calculation. On the other hand, if homologues that are too remote are used, the conservation signal/s may decrease due to background noise added to the MSA. Therefore we recommend to run several runs with increasing number of homologues (10, 30, 50, 100, etc.), and compare the conserved regions in the protein. Generally, reliable conserved regions will progressively cluster. In any case, we recommend a minimum of 10 homologues for any ConSurf run.
Position specific quality
PSI-BLAST is a local alignment engine, which means that it
finds homologues that are generally shorter than the query. For this reason the
multiple sequence alignment (MSA) usually includes many non-informative
regions, or gaps. Obviously, the quality of the calculation for each amino acid
position, i.e., column in the MSA, depends on the total number of informative
non-gapped residues in the column. We report the number of sequences that are
available for computation at each position, referred to as "MSA
DATA", in the output file "Amino Acid Conservation Score". In
cases that there are less than 6 non-gapped residues in a certain position, we
consider the conservation score in this position as unreliable, and it is
marked in the output files. When using the Bayesian method for the conservation
scores calculations, a confidence interval for the scores estimations is
obtained. The high and low values of the interval are assigned color grades
according to the 1-9 coloring scheme. If the interval is equal to- or larger
than- 4 color grades, the conservation scores are regarded as unreliable, and
are also marked in the output files.
Minimal requirements for a successful ConSurf run
When using a 3D structure:
A protein structure in PDB format and the chain identifier.
The allowed length difference between the SEQRES-derived (NSEQRES) sequence or the MSA- extracted sequence (NMSA) and the ATOM-derived sequence (NATOM):
When NSEQRES or NMSA < NATOM: The maximal difference allowed is 10%. For example, if there are 100 residues in the ATOM list, the job is rejected if the SEQRES has less than 90 residues.
When NSEQRES or NMSA > NATOM: The maximal difference allowed is 80%. For example, if SEQRES lists 100 residues, there must be at least 20 residues in the ATOM records, or else the job is rejected.
Identity of at least 60% between the ATOM-derived sequence and the SEQRES- or MSA-extracted sequence (as calculated by CLUSTALW).
For all ConSurf runs:
At least 5 homologous sequences. By default the homologues are found automatically (obtained from PSI-BLAST). Please notice that position specific conservation score is automatically considered as unreliable when it is based on less than 6 homologues.