Researches on data mining modeling theories and its applications in bioinformatics
r#ks>s Shen Hong-Bin
~_(!}V ABSTRACT
*4zoAs lU1 In the past decades, large amount of data has been obtained with the fast development of science, economic and society. How to find valuable knowledge and rules behind these data is a critical problem and is a hot research topic in both theoretical and practical researches. At the same time, the biological data has also increased exponentially with the development of the various biological devices. Under such conditions, it is both very expensive and time consuming for dealing with such large size of data only based on the conventional biological experiments. It has become a major challenge to bridge the gap between the number of newly generated data and understanding the knowledge they contain. Bioinformatics is a very young research direction, trying to find the knowledge and rules behind the biological data by combining information science, computer science, physics as well as the life science knowledge, which could be further used to explain the biological life. It is expected that the life science researches and the drug discovery can be speeded up by the bioinformatics researches. In this paper, we focus on the data mining and bioinformatics theoretical and practical researches.
=#7s+ d- Clustering analysis is one of the most important research areas in data mining. In the real world, we often have to deal with the high-dimensional dataset, in which, different attributes will contribute differently to each cluster in most cases. Considering such a problem, a kind of attribute weighted fuzzy kernel clustering algorithm is proposed. This new kernel clustering algorithm can reflect properly the attribute importance for each cluster and hence can yield much higher clustering accuracy than the conventional clustering algorithms. Another thing we often encounter in the real world is that one dataset is independent of others but also cooperate with others at the same time. Based on such cooperative constraints, new information based collaborative clustering algorithm is proposed. Such collaborative clustering algorithm considers the influence from other datasets and the corresponding clustering results will be more flexible.
c"6Kd$?M Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo amino acid composition, that were extracted from a training dataset. Their outcomes were combined thru a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns. The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have less than 25% sequence identity with the proteins used in training the classifier. Such a rate is 6-21% higher than the corresponding rates obtained by various existing NN (Neural Networks) and SVM (Support Vector Machines) approaches, implying that the ensemble classifier is very promising and might become an useful vehicle in protein science, as well proteomics and bioinformatics.
^Z#<tN; The structural class is an important attribute used to characterize the overall folding type of a protein. Proteins often have quite similar or identical folding patterns even if they consist of very different sequences or bear various biological functions. In view of this, Levitt and Chothia tried to classify proteins into the following four structural classes: (1) all- , (2) all- , (3) , and (4) . Prediction of protein classification from the sequences is both an important and a tempting topic in protein science. This is because of not only that the knowledge thus obtained can provide useful information about the overall structure of a query protein, but also that the practice itself can technically stimulate the development of novel predictors that may be straightforwardly applied to many other relevant areas. In this paper, a novel approach, the so-called “supervised fuzzy clustering approach” is introduced that is featured by utilizing the class label information during the training process. Based on such an approach, a set of “if-then” fuzzy rules for predicting the protein structural classes are extracted from a training dataset. It has been demonstrated thru three different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach are all higher than those by the unsupervised fuzzy c-means introduced by the previous investigator. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes.
lv*Wnn@k As a “building block of life”, a cell is deemed the most basic structural and functional unit of all living organisms. It is highly organized with many functional units or organelles according to the cellular anatomy. Most of these units are “enveloped” by one or more membranes, which are the structural basis for many important biological functions. Membrane proteins are a special group in the protein families, which accounts for ~30% of all proteins but solved membrane protein structures only represent <1% of known protein structures to date. This class of proteins constitutes the majority of ion channels, transporters, and receptors in living organisms, for example, phospholamban protein is an integral membrane protein that regulates the Ca2+ pump in the heart. Because of the importance of membrane proteins, they act as the targets of approximately 80% drugs in the markets. Hence, solving the structures of membrane proteins plays key important roles in modern life science researches. Due to the intrinsic structural plasticity associated with many of these proteins, the chance of obtaining crystals suitable of X-ray or electron diffraction studies is small. Although helical membrane proteins pose higher degree of experimental difficulty, their conformation is, in a number of ways, more predictable than that of water-soluble proteins. In this paper, we have proposed a novel protein sequence discrete model, i.e. PsePSSM, and an ensemble classifier framework to predict the membrane protein topology in the cell membrane. Experimental results on the stringent dataset have shown that the prediction accuracy of the membrane protein topology in the 8 classes is more than 85%, which is about 30% than the conventional methods.
Lx9hq7< The knowledge of locations of protein in the cell is closely related with its functions. Even the function characters of a protein are known, it is still critical to know where the protein functions in the cell. One of the fundamental goals in molecular cell biology and proteomics is to identify their subcellular locations or environments because the function of a protein and its role in a cell are closely correlated with which compartment or organelle it resides in. For example, in 1986 the SWISS-PROT databank contained only 3,939 entries of protein sequences; recently, the number jumped to 223,100 according to the version released on June-2006 at
http://www.ebi.ac.uk/swissprot/, meaning that the number of the entries now is more than 56 times the number of 1986! With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast and reliably annotating the subcellular locations of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly-found protein sequences for both basic research and drug discovery. In this paper, a) we have firstly in the literature proposed the prediction algorithm to predict the dynamic feature of proteins may simultaneously exist at, or move between, two or more different subcellular locations, i.e. the model that can deal with proteins with multiple subcellular location sites; b) we have firstly proposed the model of prediction protein sub-sub-cellular location problem, i.e. prediction the protein subnuclear locations; c) we have for the first time extended the prediction scope to cover 22 subcellular locations, which greatly improves the practical value of the computational models. At the same time, we have also proposed to use the novel combined “high-level” gene ontology with the “ab-initio” sequence features to predict the protein subcellular locations. Also, we have proposed the “organism specific” ideas in developing the protein subcellular location prediction models. Experimental results on the stringent datasets have shown that the performance of the new models proposed in this paper is 35% higher than the conventional methods. All of these work have been accepted and used by other international researchers.
QSO5 z2| During the researches, we have constructed 15 online bioinformatics servers at:
http://www.csbio.sjtu.edu.cn/bioinf/ and the biologists all over the world can easily submit their biological data to these servers, from which they will obtain immediate response. According to the statistics, these web-servers have been accessed and used more than 1,100,000, indicating these online servers are really useful in the life science researches. Furthermore, many calculated output from these web servers have already been published by other biologists. We believe that such user-friendly online web servers will play important roles in modern life science researches and drug discoveries.
CR} > J+-,^8) 6,sR
avs Key words: Data mining, Clustering analysis, Bioinformatics, Machine learning, Information theory, Evidence theory, Ensemble classifier, Protein structure prediction, Protein subcellular location prediction, Membrane protein type recognition, Cellular network, Protein evolution theory