The first part of the study involved the investigation of the IDH1 protein sequence to clarify its evolutionary history, while simultaneously standardizing a procedure for investigating other proteins. IDH1 is a good candidate for such a study, since it is found in such a variety of organisms and has been generally conserved.
Using the publicly available tools (BLAST and its derivatives, ScanPROSITE), I found that IDH1 was closely related to not only the other two types of IDH, but also to isopropylmalate dehydrogenase and tartrate dehydrogenase. Unfortunately, the BLAST family of tools did not help in finding which part of IDH1 was the conserved region with each class of related protein. ScanPROSITE does indicate the position of patterns, but found only one pattern, and is a tool to search for known conserved amino acid sequences -- it cannot be used for de novo searches for conserved regions.
Divide-and-BLAST, on the other hand, localized matches to very small regions of the input protein sequence. DAB can thus be used to narrow down the conserved region to a certain part of the input protein sequence. For example, there are six hits for IDH3 genes (mitochondrial, NAD dependent) in the DAB results for the E. coli protein. The six matches are from widely varying organisms, and therefore indicate a strong possibility that this region of the IDH1 gene is conserved between NADP dependent IDHs and NAD dependent IDHs. Although a similar pattern of matches is not seen in the DAB output for the other IDH1 sequences or the consensus sequence, we must remember that IDH1 is the only kind of IDH found in E. coli (and other prokaryotes); therefore any similarity to IDH3 localized to a particular region of the IDH1 E. coli gene probably represents an evolutionary relationship. DAB results based on the other sequences might not show the matches because the eukaryotic IDH genes are too closely related to one another. Consequently, the IDH3 hits show up as matches with the full sequence, and get subtracted out of the sub-sequence match list. A point to be noted here is that sometimes using the most "primitive" sequence is better than using a consensus sequence to represent the whole family -- generating the consensus sequence results in a blurring of evolutionary time that could potentially result in lost information. This would probably be true more for clarifying relationships between closely related proteins than for finding remote similarities.
DAB not only localizes known evolutionary relationships, but may indicate the presence of hitherto unknown ones as well. There are several matches for heat shock proteins (HSPs) in subsequence 10 of the E. coli protein. Heat shock proteins are known to have ATPase activity, and IDH1 binds to NADP, a molecule closely related to ATP. Perhaps the binding of both IDH1 and HSPs to adenosine phosphates is basis for the presence of an evolutionary relationship between HSPs and IDH1, but until structural and functional evidence corroborate DAB's results, it is impossible to either accept or dismiss outright the hypothesis. The lack of any matches with HSPs in any of the other IDH1 sequences tends to lower the probability of a relationship.
While looking at the DAB results of the E. coli gene, another important point that emerges is that sometimes spurious results can look very legitimate. For instance, looking at the E. coli results again, note the matches listed for epidermal growth factor receptor in subsequence 9. These results all happen to represent a single Drosophila gene with several listings, and do not truly represent an evolutionary relationship.
Some other results obtained with DAB merit following up. The human and mouse sequences had a large number of matches with phosphoglycerate kinase, another ATP binding enzyme. No such matches were seen with the other three sequences or with the consensus sequence, which detracts from their validity. Still, an examination of the tertiary structure of the two proteins in the matching region might show commonality of function, if not a true evolutionary relationship.
Several matches were also observed with proteins that bind to ADP, ATP, DNA or RNA. Since ADP and ATP are closely related to NADP by structure, the similarities could indicate a general adenosine phosphate binding motif. The matches occurred in different parts of the IDH1 sequence, and perhaps this indicates that the motif involves two or more parts of IDH1 that are close to the NADP binding site. Similarly, a few matches were observed with reductases and dehydrogenases from various species. These hits might indicate a conserved dehydrogenase domain, and should be examined further.
BLAST indicated that the caspases related most closely to caspase 3 were caspases 7 and 6, in that order. Other caspases also in the list of matches were caspases 2, 8, 9 and 1. Most of the alignments, especially the lower ranking ones, were after the first 40 amino acids of caspase 3, indicating a low conservation value for these first 40 amino acids.
ScanPROSITE results showed that caspase 3 shared its two active site sequences with other caspases, and using the active site patterns for PHI-BLAST produced results almost identical to the BLAST results. Therefore, any proteins related to caspase 3 evolutionarily which are not caspases do not share the active site area of the sequence.
Looking at the DAB results for the different caspase 3 sequences, we see that there is a consistent pattern of matches with caspase 1 (also called interleukin-1 beta converting enzyme) around the 160th amino acid mark, e.g. in the G. gallus caspase 3 results. This probably relates to the fact that the cysteine active site of caspase 3 is located in that region, and caspase 3 shares this active site sequence with almost all the caspases. None of the other caspases match in the area, but this is due to the fact that only caspase 1 has so many different sequences in the NCBI database. Most caspase sequence matches occur in the full sequence BLAST results, and so are not seen in the unique match list. Since caspase 1 has so many different entries in the database, some of them which do not figure in the full sequence BLAST results show up in the subsequence match list and thus in the DAB output.
Another similarity found by DAB that may be indicative of a true homology was the cathepsin S match in the R. norvegicus results. Cathepsin S is a cysteine protease like caspase 3, and therefore might conceivably be evolutionarily related to it.
No other consistent pattern was seen across all the caspase 3 sequences, but there were some interesting results seen in individual sequences. Caspase 3 from X. laevis matched a capsid coat protein in 3 different caliciviruses. Perhaps this indicates that the calicivirus coat protein mimics the caspase 3 protein from its host, although the hosts of the caliciviruses were all mammals.
Another curious result was the group of matches of the consensus sequence with vitamin D receptor; the completely different functions of caspase 3 and the receptor preclude any homology, yet we cannot rule out some sort of duplication-insertion event, or a loss of function mutation in either protein.
We had hoped that using DAB would help to find prokaryotic ancestors of caspase 3. Several hits were observed with bacterial and archaeal proteins of unknown function, and perhaps these are the evolutionary precursors of eukaryotic caspases. When functions for these proteins are discovered, we would be in a better position to comment on these results. The 20S proteasome subunit also matched more than one caspase 3 sequence, and in different parts of the protein as well; this similarity should be investigated further.
In the course of our use of DAB, we thought of several possible improvements to the program. One convenient feature would be the addition of hyperlinks to alignment information alongside each match. Since the BLAST program does generate this information, it might not be too hard to implement, although the increased complexity might detract from DAB's ease of use.
Another useful feature would be a graphical interface to the program, using Java or the Perl/Gtk toolkit. The use of the command line is unusual to many modern computer users, and quite alien to Macintosh users, for example. A graphical front-end for the entry of information and setting of various options would make DAB easier to use and more accessible to a wide variety of users and operating systems.
The objective of DAB was to facilitate the detection of remote similarities between proteins, which might suggest evolutionary relationships and/or functional similarities. In this respect, DAB did succeed, providing information that was difficult to find when using the NCBI BLAST tools. Important information with respect to the detection of remote similarities as well as the localization of close ones was provided by DAB results.
Unfortunately, DAB cannot and does not yield conclusive results. This limitation of DAB is due mainly to the similarity search programs it relies on. When similarities are weak, the BLAST programs have trouble distinguishing a true evolutionary relationship from a similarity occurring purely by chance. Part of the difficulty lies in the reliance on primary structure of the protein, and the programs could be improved by the incorporation of secondary (and in the future, tertiary) structure prediction. Initial efforts have already begun, with the development of the Vector Alignment Search Tool (VAST) by NCBI (Bryant and Hogue, 1996). VAST is dependent upon the protein structure being determined by X-ray crystallography, and thus searches only a small subset of proteins in the NCBI databases. Until the protein folding problem has a satisfactory computational solution, similarity detection will remain most effective for closely related sequences, with the detection of more remote relationships dependent upon a combination of computational (like DAB) and laboratory approaches.
The usefulness of DAB will increase with the growth and development of the genomic sequence databases. As the databases increase in size, the amount of information that a researcher has to examine will increase, and DAB can help narrow down the search. In addition, as more information is found regarding the thousands of proteins without an assigned function, but which are already in the databases, DAB results will grow more meaningful.
DAB is not the end of a long road towards tracing the evolution of a protein; rather, it is the first exploratory step. It suggests avenues of investigation, which can be explored with more powerful computational tools or conventional laboratory techniques. DAB is also a productivity tool, automating a process that could take a few hours manually. Till the perfect similarity search program is developed, programs like DAB provide a temporary method to tailor the existing search programs for specific applications.
Back to Table of Contents