ConSurf Frequently Asked Questions (FAQ)

1. What is the minimum number of sequences required to get reliable results?

2. What if I get only few homologous sequences for my query protein?

3. What are the advantages of using phylogenetic tree?

4. Should I use the Bayesian or the Maximum Likelihood method?

5. Is it possible for the extreme grades 1 and 9 to be unoccupied? Which conditions give this result?

6. What should I do if I find problems uploading external multiple sequence alignment (MSA) files?

7. What is the maximal ConSurf run?

8. How can I produce a ConSurf picture?

9. Are there pre-calculated conservation grades available? What is ConSurf-HSSP?

10. What if I do not have a structure?

11. How reliable are the conservation scores?

12. What is the best way to collect homologous sequences in order to construct an MSA?
Which protein databases should be used?
How does ConSurf do it?

  1. What is the minimum number of sequences required to get reliable results?

    There is no exact answer to this question, as the sequence variability matters. Nevertheless, as a 'rule of thumb' we recommend a minimum of 10 homologues. If PSI-BLAST found fewer than that, you can try to raise the E-value cutoff, for example by changing it from the default 0.001 to 0.01. However, the additional sequences found may be phylogenetically distant from the query sequence, which can influence the quality of the generated multiple sequence alignment. On the other hand, if remote homologues are added, the conservation signal/s may decrease due to background noise added to the multiple sequence alignment. You can control the number and diversity of the input sequences by providing your own alignment file.

  2. What if I get only few homologous sequences for my query protein?

    One of the issues in retrieving homologous sequences is choosing the database used for the search. For example, certain organisms are mostly represented in the trEMBL database but not in SwissProt. The Uniprot database contains sequences from both SwissProt and trEMBL, thus a possible solution is to try running ConSurf with the Uniprot database.
    Another option is to raise the "PSI-BLAST E-value Cutoff" parameter. The results will be less reliable thus should be more carefully examined.

  3. What are the advantages of using phylogenetic tree?

    • All databases contain a certain degree of over representation of certain families or species (HIV, Human, etc.). Phylogenetic trees deal better with redundancy than methods that analyze multiple sequence alignment directly, by weighting clusters of closely related sequences differently. This clustering process diminishes the influence of redundant sequences.

    • The phylogenetic tree together with the model of sequence evolution describe the evolutionary processes that generated the multiple sequence alignment. By taking the tree explicitly into account when computing conservation scores, it is possible to better identify the amino-acid replacements that could have occurred in the history of a family of homologous sequences, thus increasing the accuracy of the calculation.

    • The phylogenetic tree is a prerequisite for using probabilistic based approaches such as the maximum-likelihood and the Bayesian approach. The branch lengths of the tree, for example, are needed for computing the probabilities of amino-acid replacements. These statistically reliable methods increase the statistical reliability of the calculations and allow us to compute confidence intervals around the estimated conservation scores in the Bayesian method.

  4. Should I use the Bayesian or the Maximum Likelihood method?

    The Bayesian method was shown to significantly improve the accuracy of conservation scores estimations over the Maximum Likelihood method. This is particularly important when a small number of sequences are used for the calculations. An additional advantage of the Bayesian method is that a confidence interval is assigned to each of the inferred evolutionary conservation score. A detailed description of the Bayesian methodology is provided in Mol. Biol. Evol., 21, 1781-1791; 2004, (PDF). A detailed description of the Maximum Likelihood methodology is provided in Bioinformatics, 18, S71-77; 2002 (PDF). Rate4Site, a stand-alone application that implements both algorithms, is available at the URL: Rate4Site. We recommend running the target protein using both methods, which will provide a double check for your results. 

  6. Is it possible for the extreme grades 1 and 9 to be unoccupied? Which conditions give this result?

    Grades 1-8 can be unoccupied, although this will occur rarely, such as when ConSurf finds few homologues. Grade 9 is always occupied by at least one residue.

  8. What should I do if I find problems uploading external multiple sequence alignment (MSA) files?

    • Check your MSA file in a simple text editor (e.g. Notepad on Windows). It is very common that MSA files downloaded from the web contain unnecessary characters. Eliminate them, and save your file as text only. 
    • Some kind of incompatibility between the text format of PC / Unix and Mac machines exists. If you are running the ConSurf server from a Mac platform, and you get repetitive error messages, we recommend that you save your file using Word as an "MS-Dos" text file. This format should be compatible with the Dos and Unix text files.

  9. What is the maximal ConSurf run?

    A ConSurf run will be automatically terminated after 96 hours.

  10. How can I produce a ConSurf picture?

    • Copy the image directly from chime.

      The method depends upon whether you want the saved ConSurf views to be rotatable with the mouse, or simply a static snapshot. Here are instructions for presenting rotatable images. Instructions for copying and pasting static snapshots are below: 

      Start a PE session and set the window to be the size you desire. After you get the desired image of your molecule, click on the MDL small icon below the molecular image to pop up Chime's menu. Click on Edit, Copy. This copies the image to the clipboard. Now, go to the program into which you wish to paste the image, and paste in the image from the clipboard. 

      See the "Saving images from PE" link for more information and other methods of saving images from PE and the "Printing publication-quality images" link to obtain high quality pictures from PE.

      When publishing figures from ConSurf, we recommend including our visual color key: 

      Right click on the above color key image to save it to your disk. 
      Here is an example of a figure that includes the color key: 

    • Producing high resolution picture using PyMOL.

      In your output results page, see the links to Create a high resolution PyMOL picture

    • To produce a final high-quality picture we recommend one the "spheres" or "surface" representation:

      "pymol>>show spheres, xxx"
      "pymol>>show surface, xxx"

      and then of course 'ray' and save the picture:

      "pymol>>png xxx.png"

  11. Are there pre-calculated conservation grades available? What is ConSurf-HSSP

    The ConSurf-HSSP database provides pre-calculated results. A description of the server can be found in and in the paper "The ConSurf-HSSP database: The mapping of evolutionary conservation among homologs onto PDB structures" PROTEINS: Structure, Function, and Bioinformatics 58:610-617 PDF

  12. What if I do not have a structure?

    In such case you can use our ConSeq server (Bioinformatics. 2004. 20: 1322-1324 PDF). Alternatively, you can run ConSurf with a model structure.

  13.  How reliable are the conservation scores?

    The computation of conservation scores was tested many times. Basically, we have tested it using simulations:
    Mayrose, I., Graur, D., Ben-Tal, N., and Pupko, T. 2004. Comparison of site-specific rate-inference methods: Bayesian methods are superior. Mol. Biol. Evol. 21(9): 1781-1791. PDF
    The rates seem to converge to the true value when the number of sequences increases.

    We also evaluated the impact of a wrong tree topology on rate estimates: Mayrose, I., Mitchell, A., and Pupko, T. 2005. Site-specific evolutionary rate inference: taking phylogenetic uncertainty into account. J. Mol. Evol. 60(3):345-353. PDF

    The rate inferred by rate4site algorithm are also usually highly correlated with a Ka/Ks score that is computed based on the coding DNA sequences.

    Finally but maybe most important, it seems to give biology reasonable results.

    On the other hand, you can never be sure that the code (which is huge in rate4site) has no bugs.

  14. What is the best way to collect homologous sequences in order to construct an MSA?
    Which protein databases should be used?
    How does ConSurf do it?

    The answer to this question is not straightforward; it depends on the information we want to obtain from the multiple sequence alignment (MSA). For a general conservation analysis of a certain protein family/superfamily, where the aim is to point out the major active site or structural elements that are important for determination of the overall fold, it is advised to collect as many homologues as possible. In contrast, when searching for a functional site that is specific for a certain family/sub-family, the alignment should be smaller and include only close homologues that share the function. The caveat is that, in reality, we might not know which of the proteins share the same function.
    Generally, there are two kinds of homologues to collect:

    1. Orthologs: namely, the same protein from different species. These proteins conduct the exact same function and therefore share a high sequence similarity. Orthologs should be used for analysis of specific functions that are shared by a protein family/sub-family.
    2. Paralogs: that is, homologous proteins that can be found within the same species. These proteins diverged from a common ancestor but their function may have changed to some degree during evolution. Thus, they share lower sequence similarity, and should be used to explore the major structural and functional characteristics of the protein family/superfamily.

    Our experience has been that the SWISSPROT database is recommended in search for close homologous proteins. It yields highly annotated sequences for small collections of a small amount of species. For a broader analysis (more species and more sequences that are less annotated), the UNIPROT database should be used.
    The choice of the database should also be determined by the query protein. If it belongs to a large family such as the kinases or proteases, SWISSPROT, which provides hundreds of sequences with a low BLAST E-value, is sufficient. If it is a less abundant protein, UNIPROT is recommended.

    The issue of redundancy is not crucial when using ConSurf, since the algorithm that calculates the conservation scores takes the phylogenetic relationships within the alignment into account. This is actually one of the strongest qualities of ConSurf. It means that the conservation of positions of proteins in species that are more "evolutionary distant" is more significant and vice-versa. For example, a position conserved between human and chimpanzee, is not necessarily noteworthy, since the two species diverged recently. On the other hand, if the position is conserved in human and in flies or worms, then it may indicate importance for structure or function. Nevertheless, if a certain family is over-represented in the alignment (namely, it has more orthologs in the database) it could have an effect on the construction of the alignment, but it is not very crucial.
    To remove redundancy one may use the alignment to create a sequence, identify matrix between each pair of sequences and remove sequences that are more than 85%-90% conserved. This would probably remove most close orthologs, especially when analyzing a protein domain.

    Finally, it is important to mention that there are databases of readymade multiple sequence alignments that one could use. For example: HSSP, PFAM, HOMSTRAD, PROTOMAP. A multiple sequence alignment from one of these databases may be used as is, or taken as a basis to construct a more suitable one, e.g., by the elimination of unwanted sequences.


Top Page