Canadarm is a team of graduate students working and competing to introduce and test an efficient datamining method for the protein characterization problem. Team members are Koosha Golmohammadi (me) and my friend Brendan Crowley.
What is the project about?
This Project is a team-based competition among garaduate students of Electrical and Computer Engineering Department of the University of Alberta.
The Goal of the project is finding and testing the most efficient method to make an expert system to predict the protein type. The system is going to predict the type of any given protein based on a 2000 instances of proteins which their types are known.
- Studying the related works
- Making a list of features
- Trying to find a relation between features, revising and grouping them
- Making a relational database to represent the known space
- Testing different data mining methods
First two weeks (13-27 February 2007)
Brendan and I (Koosha) started with going through two recent papers which are the latest researches in the project field.
- Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition by S. Wanga, J. Yanga, K. Choua
- Using ensemble classifier to identify membrane protein types H. Shen, K. Chou1
Here is a short report of each above papers, the conclusion and point for next step.
The authors introduce a method called "stacked generalization" or "stacking" that is used to predict cell membrane protein type. This method involves using a high-level model to cmobine lower-level models in order to achieve greater predictive accuracy. Base classifiers are first used to predict the class values. Then these values are fed into a Meta (higher level) classifier which uses them to make a final prediction. For their base classifiers the authors chose support vector machine (SVM) and instance-based learning (IBL). The meta classifier was the decision tree C45. The results of the paper show that the stacking approach performed remarkably well in the jack-knife and independent cross-validation tests and that it is a good choice for a protein type prediction model.
Conclusion and future work
In both two papers were basically using the sequence number of amino acids(AAs) to predict the protein type therefore we started with this feature and wrote a simple C++ code to extract the sequence number of AAs from the training dataset and generating an input file in weka format.
The C++ code is now available and the result file of the training dataset which is made by this program is also available. We tested the generated file in Weka and it worked properly however we can not rely on the Weka results based on this file because there is only one feature.
As our next step we are focusing on finding new features to add to our system and extract new information from the training dataset.
Second two weeks (27-13 March 2007)
We divided searching for the features in to two parts. Brendan is going to search the internet and I (Koosha) am going to search the papers regarding membrane protein.