Tools for tracing protein evolution


Proteins evolve structurally from other proteins. Almost all proteins share one or more common structural features with proteins that are not their direct homologues. These structural similarities are seen in the amino acid sequence of these proteins, and can be detected by sequence analysis programs. The following are some of the programs that yield useful information about common protein motifs, and a "recipe" for their use. You can follow the instructions in this frame and conduct your searches/analyses in the right hand side frame.


Getting the protein sequence

A number of sequence analysis programs allow the user to enter some sort of accession number to retrieve the sequence to be analysed, but in general, it is best to have the sequence that one wants to analyse in hand. To do this, one of the best places to visit is the National Center for Biological Information (NCBI) web page.
1. Go to the Entrez section.
2. Click on Proteins.
3. Enter the name of the protein you want in the text field and click Search. You can also limit your search by selecting a field to search, e.g. setting Search Field to Protein Name will restrict hits to sequences for your protein only, not returning results that have your protein named just incidentally.
4. Click on Retrieve Sequences at the next screen.
5. Browse the results to find your desired protein sequence. If you need more information about a sequence, click on GenPept Report.
6. Once you have chosen a sequence, you will need to save the FASTA format of the sequence. The FASTA format is the format accepted by most analysis programs. You can access this by clicking FASTA in the entry for your sequence in the list of hits.



Finding known conserved domains

Known domains in a protein sequence have usually been well studied and characterized. Before you begin to look for new evolutionary relationships, it is very useful to examine the known ones and set up a general framework of how your protein evolved. The following procedures will help you with this:

A. ScanProsite domain search

ScanProsite is part of a suite of sequence analysis programs at the Swiss Institute of Bioinformatics. It searches a given protein sequence for the occurence of domains listed in the Prosite database.
1. Go to the ScanProsite submission page.
2. Copy and paste your protein sequence into the text box.
3. Check the "Exclude patterns with a high probability of occurence" box. This will reduce hits based on short amino acid patterns, which often are just random. You might want to uncheck the box if you are looking for a short pattern in your sequence.
4. Examine the results. Each hit has a link to more information about the pattern, so explore these links to learn more about any patterns found, and other proteins in which they are found as well. Note that the page describing a certain pattern or signature has a field marked Consensus Pattern. You can use this pattern for PHI-BLAST searches, described in the next section.

B. PHI-BLAST at NCBI

PHI-BLAST (Pattern Hit Initiated BLAST) is a program that allows you to specify a pattern to search for in your protein sequence. If the pattern is found, PHI-BLAST then searches the NCBI sequence databases for other proteins that fit the pattern. PHI-BLAST thus helps you find proteins that share a pattern/domain with your protein sequence of interest.
1. Go to the PHI-BLAST page.
2. Copy and paste your amino acid sequence into the box marked Enter here your amino acid sequence.
3. Paste the Consensus Pattern that you found in the ScanProsite search into the Pattern for use in PHI-BLAST box.
4. Click Submit Query and examine the list of hits you get. Note any groups of homologues that are not homologous to your protein -- these are good indications of a family related to your protein through your pattern.

C. PSI-BLAST at NCBI

Section under construction.


Finding unknown conserved domains

Finding previously unknown domains that your protein shares with others is of course the most exciting part of the process. It is also the hardest; it is almost impossible to decide where to start. Some suggestions are given below:
1. Look at the three-dimensional structure of the protein using Rasmol (A how-to for this should be up shortly). Find functionally important parts -- this will need a good knowledge of the bioogy of the protein. Now use these sections of the protein sequence in BLAST and PSI-BLAST searches.
2. Use the program Divide-and-BLAST. More information about the program and its use can be found here.

Presenting your results

Section under construction.


Comments? Questions? Contact rakarnik@davidson.edu.