2009年全国优秀博士论文:数据挖掘的建模及在生物信息学中的应用研究

作者姓名：沈红斌
　　论文题目：数据挖掘的建模及在生物信息学中的应用研究
　　作者简介：沈红斌，男，1979年8月出生，2004年4月师从于上海交通大学杨杰教授，于2007年3月获博士学位。

　　中文摘要
　　随着科学技术的飞速发展，经济和社会都取得了极大的进步，与此同时，在各个领域产生了大量的数据，如何从这些数据中发现有价值的知识及规律，成为目前理论与实践研究的热点与难点。与此同时，生命科学技术的快速发展也产生了大量的生物数据，单纯地利用传统的生物实验方法将很难快速且全面的处理如此多生物数据，从而必然制约了生命科学及制药工程的快速发展。在这种情况下，生物信息学应运而生。生物信息学是一门生物学与信息科学交叉而形成的年轻学科，旨在运用信息学、物理学、化学、数学、计算机科学、系统科学的理论和方法来研究生物系统和生物过程的信息量和信息流，在已有数据的基础之上发现相应的规律和知识并进而用来进一步指导与解释生物实验与生命现象，加速对生命本质特征的认识。本论文在数据挖掘及生物信息学理论与方法上进行了深入的研究与探索。
　　聚类分析是数据挖掘研究中的重要内容，成为各学科研究中的重要工具。但在现实生活中，常常遇到高维数据集的处理且在大多数情况下，这些数据集对于各个聚类存在属性不平衡的现象。根据这一点，本文创新性提出了在核特征空间中的属性加权核聚类算法，实验表明新聚类算法能很好地反映各属性对于各个聚类的重要性，因而取得了比传统聚类算法更好的结果。传统聚类算法的应用对象往往局限于单一独立的数据集，但在很多情况下一个数据集要和其他数据集相互发生关联。基于信息理论，本文首先提出了一合作聚类算法，反映了数据集间的相互作用关系，结果表明聚类结果将受到其他数据集的影响。我们同时也从理论上证明了这两个算法的收敛性。
　　蛋白折叠是比蛋白的三维结构更深层次的知识信息，因而是更加困难的研究课题，同时，从蛋白序列预测蛋白折叠类型能够进一步为预测该蛋白的三维结构提供极有价值的信息。本文从生物系统的复杂性角度出发，创新性地提出了基于集成分类器框架的蛋白折叠预测系统，从多个生物特征角度对序列信息源及特征进行融合决策预测，结果证明所得到的集成预测系统是非常有效的，把蛋白折叠的预测精度提高了6-21%。
　　蛋白的三维结构是标识所有蛋白折叠类型的重要属性。即使蛋白之间所包含的序列信息或者其功能特性有所不同，其所包含的折叠类型或者结构类型也可能是相似的。鉴于此，Levitt和Chothia把蛋白分成以下的4种结构类型：(1) all- ，(2) all- ，(3) 和 (4) 。从蛋白序列出发，预测蛋白的结构类型是蛋白质科学中的重要研究课题。本文首次有机地将有监督聚类算法与模糊系统学习算法结合在一起进行蛋白三级结构预测，提高了蛋白结构预测的精度，该工作第一次将模糊系统学习方法引入到蛋白结构预测中，为生物信息学进一步的研究开辟了新的思路。
　　膜蛋白是一种非常重要的蛋白，占人体蛋白总数的约1/3，但目前已经知道的膜蛋白结构只占1%左右。膜蛋白的主要功能之一是离子通道，我们的认知、感觉、情绪等的产生都是由于这些通道在不停地开关，所以，膜蛋白对人体的重要性是不言而喻的，如phospholamban离子通道蛋白对心脏功能有着重要作用。绝大多数疾病都是由于某一特定的膜蛋白不足引起的，现在市场上销售的80%的药物都集中在膜蛋白上。因此，研究膜蛋白的序列特征以及其三维结构对于了解膜蛋白的功能起着重要的作用，已经成为结构生物学中的研究热点，但同时由于膜蛋白不溶于水的特性也使得生物实验方法求解膜蛋白结构非常困难，这就为我们利用计算方法从序列预测膜蛋白拓扑结构提出了挑战及崭新的课题。本文创新性地提出了基于集成分类器模型及蛋白序列进化信息的新颖PsePSSM离散化模型，提出了融合序列功能域特征及PsePSSM特征的蛋白属性预测框架，并成功应用于膜蛋白拓扑结构预测及酶蛋白功能家族预测，新预测模型在8类膜蛋白的拓扑结构上准确率达到了85%以上，比传统方法的预测精度提高了约30%。
　　蛋白在细胞中的位置信息与其功能特性是密切相关的，甚至即使我们知道了一个蛋白的功能特性，了解该蛋白在细胞中行使功能的位置也是非常重要的。例如，细胞核包含了细胞的遗传因子DNA，控制着细胞的整个活动过程等。但随着人类基因项目的成功实施，人类所发现的新蛋白数目呈现指数增长的趋势，根据国际蛋白数据库UniProtKB/Swiss-Prot的统计，2006年6月份的蛋白数目达到了223,100，比1986年增加了56倍多。面对如此快的蛋白合成速度，单纯依靠生物实验方法测定蛋白的亚细胞位置是几乎不可能完成的任务，迫切希望能通过生物信息学的研究在已经掌握的相关知识的基础上提出预测分析新蛋白的亚细胞位置，为加快生命科学研究及制药工程服务。本文首次在国际上提出并探讨了a) 蛋白在细胞中多个位置出现的预测模型；b) 蛋白在细胞核中出现的位置的预测模型，即 “亚亚细胞位置预测模型”，获得国际学术界的认可；c) 本文首次将亚细胞定位的预测研究推广到覆盖22个亚细胞位置，极大地提高了预测模型的实用价值，并提出了融合蛋白序列高层基因本体特征及序列自身氨基酸特征的蛋白亚细胞位置预测方法，提出了面向不同物种的亚细胞定位的预测新思路；结果表明新算法方法在严格的数据集上获得了比传统算法方法高出35%以上的预测精度，所开发的工具被广泛应用于生物实验中。
　　为了推广理论研究成果的应用，我们在科学研究中还建立了15个在线的生物信息学网站平台：http://www.csbio.sjtu.edu.cn/bioinf/，全世界的相关领域生物学家只要通过互联网提交生物数据，就能得到网站即时运算返回的结果。经不完全统计，网站已被使用了1,100,000余次，极大地推动了生物信息学理论研究的应用成果化。国际上许多生物学家在发表的学术论文中应用了经我们所开发的生物信息学应用平台分析运算得到的相关数据来验证他们的实验结果，获得了良好的评价。

　　关键词：数据挖掘，聚类分析，生物信息学，机器学习，信息理论，证据理论，集成分类器，蛋白结构预测，蛋白亚细胞位置预测，膜蛋白识别，细胞网络，蛋白进化理论

评价一下你浏览此帖子的感受

分享到： QQ空间新浪微博腾讯微博人人网网易微博

关键词: 博士

回复引用

举报顶端

nanafly

级别: 总版主

显示用户信息

沙发发表于: 2009-10-10

只看该作者 ┊ 小中大

Researches on data mining modeling theories and its applications in bioinformatics
Shen Hong-Bin
ABSTRACT
In the past decades, large amount of data has been obtained with the fast development of science, economic and society. How to find valuable knowledge and rules behind these data is a critical problem and is a hot research topic in both theoretical and practical researches. At the same time, the biological data has also increased exponentially with the development of the various biological devices. Under such conditions, it is both very expensive and time consuming for dealing with such large size of data only based on the conventional biological experiments. It has become a major challenge to bridge the gap between the number of newly generated data and understanding the knowledge they contain. Bioinformatics is a very young research direction, trying to find the knowledge and rules behind the biological data by combining information science, computer science, physics as well as the life science knowledge, which could be further used to explain the biological life. It is expected that the life science researches and the drug discovery can be speeded up by the bioinformatics researches. In this paper, we focus on the data mining and bioinformatics theoretical and practical researches.
Clustering analysis is one of the most important research areas in data mining. In the real world, we often have to deal with the high-dimensional dataset, in which, different attributes will contribute differently to each cluster in most cases. Considering such a problem, a kind of attribute weighted fuzzy kernel clustering algorithm is proposed. This new kernel clustering algorithm can reflect properly the attribute importance for each cluster and hence can yield much higher clustering accuracy than the conventional clustering algorithms. Another thing we often encounter in the real world is that one dataset is independent of others but also cooperate with others at the same time. Based on such cooperative constraints, new information based collaborative clustering algorithm is proposed. Such collaborative clustering algorithm considers the influence from other datasets and the corresponding clustering results will be more flexible.
Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo amino acid composition, that were extracted from a training dataset. Their outcomes were combined thru a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns. The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have less than 25% sequence identity with the proteins used in training the classifier. Such a rate is 6-21% higher than the corresponding rates obtained by various existing NN (Neural Networks) and SVM (Support Vector Machines) approaches, implying that the ensemble classifier is very promising and might become an useful vehicle in protein science, as well proteomics and bioinformatics.
     The structural class is an important attribute used to characterize the overall folding type of a protein. Proteins often have quite similar or identical folding patterns even if they consist of very different sequences or bear various biological functions. In view of this, Levitt and Chothia tried to classify proteins into the following four structural classes: (1) all- , (2) all- , (3) , and (4) . Prediction of protein classification from the sequences is both an important and a tempting topic in protein science. This is because of not only that the knowledge thus obtained can provide useful information about the overall structure of a query protein, but also that the practice itself can technically stimulate the development of novel predictors that may be straightforwardly applied to many other relevant areas. In this paper, a novel approach, the so-called “supervised fuzzy clustering approach” is introduced that is featured by utilizing the class label information during the training process. Based on such an approach, a set of “if-then” fuzzy rules for predicting the protein structural classes are extracted from a training dataset. It has been demonstrated thru three different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach are all higher than those by the unsupervised fuzzy c-means introduced by the previous investigator. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes.
As a “building block of life”, a cell is deemed the most basic structural and functional unit of all living organisms. It is highly organized with many functional units or organelles according to the cellular anatomy. Most of these units are “enveloped” by one or more membranes, which are the structural basis for many important biological functions. Membrane proteins are a special group in the protein families, which accounts for ~30% of all proteins but solved membrane protein structures only represent <1% of known protein structures to date. This class of proteins constitutes the majority of ion channels, transporters, and receptors in living organisms, for example, phospholamban protein is an integral membrane protein that regulates the Ca2+ pump in the heart. Because of the importance of membrane proteins, they act as the targets of approximately 80% drugs in the markets. Hence, solving the structures of membrane proteins plays key important roles in modern life science researches. Due to the intrinsic structural plasticity associated with many of these proteins, the chance of obtaining crystals suitable of X-ray or electron diffraction studies is small. Although helical membrane proteins pose higher degree of experimental difficulty, their conformation is, in a number of ways, more predictable than that of water-soluble proteins. In this paper, we have proposed a novel protein sequence discrete model, i.e. PsePSSM, and an ensemble classifier framework to predict the membrane protein topology in the cell membrane. Experimental results on the stringent dataset have shown that the prediction accuracy of the membrane protein topology in the 8 classes is more than 85%, which is about 30% than the conventional methods.
The knowledge of locations of protein in the cell is closely related with its functions.  Even the function characters of a protein are known, it is still critical to know where the protein functions in the cell. One of the fundamental goals in molecular cell biology and proteomics is to identify their subcellular locations or environments because the function of a protein and its role in a cell are closely correlated with which compartment or organelle it resides in. For example, in 1986 the SWISS-PROT databank contained only 3,939 entries of protein sequences; recently, the number jumped to 223,100 according to the version released on June-2006 at http://www.ebi.ac.uk/swissprot/, meaning that the number of the entries now is more than 56 times the number of 1986! With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast and reliably annotating the subcellular locations of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly-found protein sequences for both basic research and drug discovery. In this paper, a) we have firstly in the literature proposed the prediction algorithm to predict the dynamic feature of proteins may simultaneously exist at, or move between, two or more different subcellular locations, i.e. the model that can deal with proteins with multiple subcellular location sites; b) we have firstly proposed the model of prediction protein sub-sub-cellular location problem, i.e. prediction the protein subnuclear locations; c) we have for the first time extended the prediction scope to cover 22 subcellular locations, which greatly improves the practical value of the computational models. At the same time, we have also proposed to use the novel combined “high-level” gene ontology with the “ab-initio” sequence features to predict the protein subcellular locations. Also, we have proposed the “organism specific” ideas in developing the protein subcellular location prediction models. Experimental results on the stringent datasets have shown that the performance of the new models proposed in this paper is 35% higher than the conventional methods. All of these work have been accepted and used by other international researchers.
During the researches, we have constructed 15 online bioinformatics servers at: http://www.csbio.sjtu.edu.cn/bioinf/ and the biologists all over the world can easily submit their biological data to these servers, from which they will obtain immediate response. According to the statistics, these web-servers have been accessed and used more than 1,100,000, indicating these online servers are really useful in the life science researches. Furthermore, many calculated output from these web servers have already been published by other biologists. We believe that such user-friendly online web servers will play important roles in modern life science researches and drug discoveries.


Key words: Data mining, Clustering analysis, Bioinformatics, Machine learning,  Information theory, Evidence theory, Ensemble classifier, Protein structure prediction, Protein subcellular location prediction, Membrane protein type recognition, Cellular network, Protein evolution theory

回复引用

举报顶端

上一主题下一主题

考博论坛 » 博士论文

http://freekaobo.com
访问内容超出本站范围，不能确定是否安全
继续访问	取消访问

快速回复
	限 100 字节进入高级模式加粗字体颜色背景颜色插入链接图片验证问题: 5+2=? 正确答案:7 按"Ctrl+Enter"直接提交	上一个下一个