Letters in Organic Chemistry - Volume 16, Issue 4, 2019
Volume 16, Issue 4, 2019
-
-
Application of Machine Learning Techniques to Predict Protein Phosphorylation Sites
Authors: Shengli Zhang, Xian Li, Chengcheng Fan, Zhehui Wu and Qian LiuProtein phosphorylation is one of the most important post-translational modifications of proteins. Almost all processes that regulate the life activities of an organism as well as almost all physiological and pathological processes are involved in protein phosphorylation. In this paper, we summarize specific implementation and application of the methods used in protein phosphorylation site prediction such as the support vector machine algorithm, random forest, Jensen-Shannon divergence combined with quadratic discriminant analysis, Adaboost algorithm, increment of diversity with quadratic discriminant analysis, modified CKSAAP algorithm, Bayes classifier combined with phosphorylation sequences enrichment analysis, least absolute shrinkage and selection operator, stochastic search variable selection, partial least squares and deep learning. On the basis of this prediction, we use k-nearest neighbor algorithm with BLOSUM80 matrix method to predict phosphorylation sites. Firstly, we construct dataset and remove the redundant set of positive and negative samples, that is, removal of protein sequences with similarity of more than 30%. Next, the proposed method is evaluated by sensitivity (Sn), specificity (Sp), accuracy (ACC) and Mathew’s correlation coefficient (MCC) these four metrics. Finally, tenfold cross-validation is employed to evaluate this method. The result, which is verified by tenfold cross-validation, shows that the average values of Sn, Sp, ACC and MCC of three types of amino acid (serine, threonine, and tyrosine) are 90.44%, 86.95%, 88.74% and 0.7742, respectively. A comparison with the predictive performance of PhosphoSVM and Musite reveals that the prediction performance of the proposed method is better, and it has the advantages of simplicity, practicality and low time complexity in classification.
-
-
-
Identification of Mitochondrial Proteins of Malaria Parasite Adding the New Parameter
Authors: Feng Yonge and Xie WeixiaMalaria has been one of the serious infectious diseases caused by Plasmodium falciparum (P. falciparum). Mitochondrial proteins of P. falciparum are regarded as effective drug targets against malaria. Thus, it is necessary to accurately identify mitochondrial proteins of malaria parasite. Many algorithms have been proposed for the prediction of mitochondrial proteins of malaria parasite and yielded the better results. However, the parameters used by these methods were primarily based on amino acid sequences. In this study, we added a novel parameter for predicting mitochondrial proteins of malaria parasite based on protein secondary structure. Firstly, we extracted three feature parameters, namely, three kinds of protein secondary structures compositions (3PSS), 20 amino acid compositions (20AAC) and 400 dipeptide compositions (400DC), and used the analysis of variance (ANOVA) to screen 400 dipeptides. Secondly, we adopted these features to predict mitochondrial proteins of malaria parasite by using support vector machine (SVM). Finally, we found that 1) adding the feature of protein secondary structure (3PSS) can indeed improve the prediction accuracy. This result demonstrated that the parameter of protein secondary structure is a valid feature in the prediction of mitochondrial proteins of malaria parasite; 2) feature combination can improve the prediction’s results; feature selection can reduce the dimension and simplify the calculation. We achieved the sensitivity (Sn) of 98.16%, the specificity (Sp) of 97.64% and overall accuracy (Acc) of 97.88% with 0.957 of Mathew’s correlation coefficient (MCC) by using 3PSS+ 20AAC+ 34DC as a feature in 15-fold cross-validation. This result is compared with that of the similar work in the same dataset, showing the superiority of our work.
-
-
-
Prediction of Protein-Protein Interaction Based on Weighted Feature Fusion
Authors: Chunhua Zhang, Sijia Guo, Jingbo Zhang, Xizi Jin, Yanwen Li, Ning Du, Pingping Sun and Baohua JiangProtein-protein interactions play an important role in biological and cellular processes. Biochemistry experiment is the most reliable approach identifying protein-protein interactions, but it is time-consuming and expensive. It is one of the important reasons why there is only a little fraction of complete protein-protein interactions networks available by far. Hence, accurate computational methods are in a great need to predict protein-protein interactions. In this work, we proposed a new weighted feature fusion algorithm for protein-protein interactions prediction, which extracts both protein sequence feature and evolutionary feature, for the purpose to use both global and local information to identify protein-protein interactions. The method employs maximum margin criterion for feature selection and support vector machine for classification. Experimental results on 11188 protein pairs showed that our method had better performance and robustness. Performed on the independent database of Helicobacter pylori, the method achieved 99.59% sensitivity and 93.66% prediction accuracy, while the maximum margin criterion is 88.03%. The results indicated that our method was more efficient in predicting protein-protein interaction compared with other six state-of-the-art peer methods.
-
-
-
Prediction of Acetylation and Succinylation in Proteins Based on Multilabel Learning RankSVM
Authors: Yan Xu, Yingxi Yang, Zu Wang and Yuanhai ShaoIn vivo, one of the most efficient biological mechanisms for expanding the genetic code and regulating cellular physiology is protein post-translational modification (PTM). Because PTM can provide very useful information for both basic research and drug development, identification of PTM sites in proteins has become a very important topic in bioinformatics. Lysine residue in protein can be subjected to many types of PTMs, such as acetylation, succinylation, methylation and propionylation and so on. In order to deal with the huge protein sequences, the present study is devoted to developing computational techniques that can be used to predict the multiple K-type modifications of any uncharacterized protein timely and effectively. In this work, we proposed a method which could deal with the acetylation and succinylation prediction in a multilabel learning. Three feature constructions including sequences and physicochemical properties have been applied. The multilabel learning algorithm RankSVM has been first used in PTMs. In 10-fold cross-validation the predictor with physicochemical properties encoding got accuracy 73.86%, abslute-true 64.70%, respectively. They were better than the other feature constructions. We compared with other multilabel algorithms and the existing predictor iPTM-Lys. The results of our predictor were better than other methods. Meanwhile we also analyzed the acetylation and succinylation peptides which could illustrate the results.
-
-
-
Prediction of Nitrosocysteine Sites Using Position and Composition Variant Features
Authors: Yaser D. Khan, Aroosa Batool, Nouman Rasool, Sher Afzal Khan and Kuo-Chen ChouS-nitrosylation is one of the most prominent posttranslational modification among proteins. It involves the addition of nitrogen oxide group to cysteine thiols forming S-nitrosocysteine. Evidence suggests that S-nitrosylation plays a foremost role in numerous human diseases and disorders. The incorporation of techniques for robust identification of S-nitrosylated proteins is highly anticipated in biological research and drug discovery. The proposed system endeavors a novel strategy based on a statistical and computational intelligent methods for the identification of S-nitrosocystiene sites within a given primary protein sequence. For this purpose, 5-step rule was approached comprising of benchmark dataset creation, mathematical modelling, prediction, evaluation and web-server development. For position relative feature extraction, statistical moments were used and a multilayer neural network was trained adapting Gradient Descent and Adaptive Learning algorithms. The results were comparatively analyzed with existing techniques using benchmark datasets. It is inferred through conclusive experimentation that the proposed scheme is very propitious, accurate and exceptionally effective for the prediction of S-nitrosocystiene in protein sequences.
-
-
-
iAFP-gap-SMOTE: An Efficient Feature Extraction Scheme Gapped Dipeptide Composition is Coupled with an Oversampling Technique for Identification of Antifreeze Proteins
Authors: Shahid Akbar, Maqsood Hayat, Muhammad Kabir and Muhammad IqbalAntifreeze proteins (AFPs) perform distinguishable roles in maintaining homeostatic conditions of living organisms and protect their cell and body from freezing in extremely cold conditions. Owing to high diversity in protein sequences and structures, the discrimination of AFPs from non- AFPs through experimental approaches is expensive and lengthy. It is, therefore, vastly desirable to propose a computational intelligent and high throughput model that truly reflects AFPs quickly and accurately. In a sequel, a new predictor called “iAFP-gap-SMOTE” is proposed for the identification of AFPs. Protein sequences are expressed by adopting three numerical feature extraction schemes namely; Split Amino Acid Composition, G-gap di-peptide Composition and Reduce Amino Acid alphabet composition. Usually, classification hypothesis biased towards majority class in case of the imbalanced dataset. Oversampling technique Synthetic Minority Over-sampling Technique is employed in order to increase the instances of the lower class and control the biasness. 10-fold cross-validation test is applied to appraise the success rates of “iAFP-gap-SMOTE” model. After the empirical investigation, “iAFP-gap-SMOTE” model obtained 95.02% accuracy. The comparison suggested that the accuracy of” iAFP-gap-SMOTE” model is higher than that of the present techniques in the literature so far. It is greatly recommended that our proposed model “iAFP-gap-SMOTE” might be helpful for the research community and academia.
-
-
-
An Epidemic Avian Influenza Prediction Model Based on Google Trends
Authors: Yi Lu, Shuo Wang, Jianying Wang, Guangya Zhou, Qiang Zhang, Xiang Zhou, Bing Niu, Qin Chen and Kuo-Chen ChouThe occurrence of epidemic avian influenza (EAI) not only hinders the development of a country's agricultural economy, but also seriously affects human beings’ life. Recently, the information collected from Google Trends has been increasingly used to predict various epidemics. In this study, using the relevant keywords in Google Trends as well as the multiple linear regression approach, a model was developed to predict the occurrence of epidemic avian influenza. It was demonstrated by rigorous cross-validations that the success rates achieved by the new model were quite high, indicating the predictor will become a very useful tool for hospitals and health providers.
-
-
-
Quantitative Structure-activity Relationship of Acetylcholinesterase Inhibitors based on mRMR Combined with Support Vector Regression
Authors: Jiaxiang Wu, Guozhao Mai, Bowen Deng, Jeong Younseo, Dongsu Du, Fuxue Chen and Qiaorong MaIn this work, support vector regression (SVR), an effective machine learning method, proposed by Vapnik was applied to establish QSAR model for a series of AchEI. Fourteen descriptors were selected for constructing the SVR mode by using mRMR-Forward feature selection method. The parameters (, C) were adjusted by leave-one-out cross validation (LOOCV) method which was used to judge the predictive power of different models. After optimization, one optimal SVR-QSAR model was attained, and the mean relative errors (MRE) of LOOCV by using SVR is 1.72%. As a result, LogP negatively affected the activity, Refractivity and Water Accessible Surface Area positively affected the activity.
-
-
-
Protein Structural Class Prediction Based on Distance-related Statistical Features from Graphical Representation of Predicted Secondary Structure
Authors: Liang Kong, Lichao Zhang, Xiaodong Han and Jinfeng LvProtein structural class prediction is beneficial to protein structure and function analysis. Exploring good feature representation is a key step for this prediction task. Prior works have demonstrated the effectiveness of the secondary structure based feature extraction methods especially for lowsimilarity protein sequences. However, the prediction accuracies still remain limited. To explore the potential of secondary structure information, a novel feature extraction method based on a generalized chaos game representation of predicted secondary structure is proposed. Each protein sequence is converted into a 20-dimensional distance-related statistical feature vector to characterize the distribution of secondary structure elements and segments. The feature vectors are then fed into a support vector machine classifier to predict the protein structural class. Our experiments on three widely used lowsimilarity benchmark datasets (25PDB, 1189 and 640) show that the proposed method achieves superior performance to the state-of-the-art methods. It is anticipated that our method could be extended to other graphical representations of protein sequence and be helpful in future protein research.
-
-
-
Combining Support Vector Machine with Dual g-gap Dipeptides to Discriminate between Acidic and Alkaline Enzymes
Authors: Xianfang Wang, Hongfei Li, Peng Gao, Yifeng Liu and Wenjing ZengThe catalytic activity of the enzyme is different from that of the inorganic catalyst. In a high-temperature, over-acid or over-alkaline environment, the structure of the enzyme is destroyed and then loses its activity. Although the biochemistry experiments can measure the optimal PH environment of the enzyme, these methods are inefficient and costly. In order to solve these problems, computational model could be established to determine the optimal acidic or alkaline environment of the enzyme. Firstly, in this paper, we introduced a new feature called dual g-gap dipeptide composition to formulate enzyme samples. Subsequently, the best feature was selected by using the F value calculated from analysis of variance. Finally, support vector machine was utilized to build prediction model for distinguishing acidic from alkaline enzyme. The overall accuracy of 95.9% was achieved with Jackknife cross-validation, which indicates that our method is professional and efficient in terms of acid and alkaline enzyme predictions. The feature proposed in this paper could also be applied in other fields of bioinformatics.
-
-
-
Identification of Phage Virion Proteins by Using the g-gap Tripeptide Composition
Authors: Liangwei Yang, Hui Gao, Zhen Liu and Lixia TangPhages are widely distributed in locations populated by bacterial hosts. Phage proteins can be divided into two main categories, that is, virion and non-virion proteins with different functions. In practice, people mainly use phage virion proteins to clarify the lysis mechanism of bacterial cells and develop new antibacterial drugs. Accurate identification of phage virion proteins is therefore essential to understanding the phage lysis mechanism. Although some computational methods have been focused on identifying virion proteins, the result is not satisfying which gives more room for improvement. In this study, a new sequence-based method was proposed to identify phage virion proteins using g-gap tripeptide composition. In this approach, the protein features were firstly extracted from the ggap tripeptide composition. Subsequently, we obtained an optimal feature subset by performing incremental feature selection (IFS) with information gain. Finally, the support vector machine (SVM) was used as the classifier to discriminate virion proteins from non-virion proteins. In 10-fold crossvalidation test, our proposed method achieved an accuracy of 97.40% with AUC of 0.9958, which outperforms state-of-the-art methods. The result reveals that our proposed method could be a promising method in the work of phage virion proteins identification.
-
-
-
Multidimensional Integration Analysis of Autophagy-related Modules in Colorectal Cancer
Authors: Yang Zhang, Zheng Zhang, Dong Wang, Jianzhen Xu, Yanhui Li, Hong Wang, Jin Li, Shaowen Mo, Yuncong Zhang, Yunqing Lin, Xiuzhao Fan, Enmin Li, Jian Huang, Huihui Fan and Ying YiColorectal cancer (CRC) is a common malignant tumor of the digestive tract occurring in the colon, which mainly divided into adenocarcinoma, mucinous adenocarcinoma, and undifferentiated carcinoma. However, autophagy is related to the occurrence and development of various kinds of human diseases such as cancer. There is little research on the relationship between CRC and autophagy. Hence, we performed multidimensional integration analysis to systematically explore potential relationship between autophagy and CRC. Based on gene expression datasets of colon adenocarcinoma (COAD) and protein-protein interactions (PPIs), we first identified 12 autophagy-related modules in COAD using WGCNA. Then, 9 module pairs which with significantly crosstalk were deciphered, a total of 6 functional modules. Autophagy-related genes in these modules were closely related with CRC, emphasizing that the important role of autophagy-related genes in CRC, including PPP2CA and EIF4E, etc. In addition to, by integrating transcription factor (TF)-target and RNA-associated interactions, a regulation network was constructed, in which 42 TFs (including SMAD3 and TP53, etc.) and 20 miRNAs (including miR-20 and miR-30a, etc.) were identified as pivot regulators. Pivot TFs were mainly involved in cell cycle, cell proliferation and pathways in cancer. And pivot miRNAs were demonstrated associated with CRC. It suggests that these pivot regulators might be have an effect on the development of CRC by regulating autophagy. In a word, our results suggested that multidimensional integration strategy provides a novel approach to discover potential relationships between autophagy and CRC, and further improves our understanding of autophagy and tumor in human.
-
-
-
iAI-DSAE: A Computational Method for Adenosine to Inosine Editing Site Prediction
Authors: Zhao-Chun Xu, Xuan Xiao, Wang-Ren Qiu, Peng Wang and Xin-Zhu FangAs an important post-transcriptional modification, adenosine-to-inosine RNA editing generally occurs in both coding and noncoding RNA transcripts in which adenosines are converted to inosines. Accordingly, the diversification of the transcriptome can be resulted in by this modification. It is significant to accurately identify adenosine-to-inosine editing sites for further understanding their biological functions. Currently, the adenosine-to-inosine editing sites would be determined by experimental methods, unfortunately, it may be costly and time consuming. Furthermore, there are only a few existing computational prediction models in this field. Therefore, the work in this study is starting to develop other computational methods to address these problems. Given an uncharacterized RNA sequence that contains many adenosine resides, can we identify which one of them can be converted to inosine, and which one cannot? To deal with this problem, a novel predictor called iAI-DSAE is proposed in the current study. In fact, there are two key issues to address: one is ‘what feature extraction methods should be adopted to formulate the given sample sequence?’ The other is ‘what classification algorithms should be used to construct the classification model?’ For the former, a 540-dimensional feature vector is extracted to formulate the sample sequence by dinucleotide-based auto-cross covariance, pseudo dinucleotide composition, and nucleotide density methods. For the latter, we use the present more popular method i.e. deep spare autoencoder to construct the classification model. Generally, ACC and MCC are considered as the two of the most important performance indicators of a predictor. In this study, in comparison with those of predictor PAI, they are up 2.46% and 4.14%, respectively. The two other indicators, Sn and Sp, rise at certain degree also. This indicates that our predictor can be as an important complementary tool to identify adenosine-toinosine RNA editing sites. For the convenience of most experimental scientists, an easy-to-use web-server for identifying adenosine-to-inosine editing sites has been established at: http://www.jci-bioinfo.cn/iAI-DSAE, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It is important to identify adenosine-to-inosine editing sites in RNA sequences for the intensive study on RNA function and the development of new medicine. In current study, a novel predictor, called iAI-DSAE, was proposed by using three feature extraction methods including dinucleotidebased auto-cross covariance, pseudo dinucleotide composition and nucleotide density. The jackknife test results of the iAI-DSAE predictor based on deep spare auto-encoder model show that our predictor is more stable and reliable. It has not escaped our notice that the methods proposed in the current paper can be used to solve many other problems in genome analysis.
-
Volumes & issues
-
Volume 22 (2025)
-
Volume 21 (2024)
-
Volume 20 (2023)
-
Volume 19 (2022)
-
Volume 18 (2021)
-
Volume 17 (2020)
-
Volume 16 (2019)
-
Volume 15 (2018)
-
Volume 14 (2017)
-
Volume 13 (2016)
-
Volume 12 (2015)
-
Volume 11 (2014)
-
Volume 10 (2013)
-
Volume 9 (2012)
-
Volume 8 (2011)
-
Volume 7 (2010)
-
Volume 6 (2009)
-
Volume 5 (2008)
-
Volume 4 (2007)
-
Volume 3 (2006)
-
Volume 2 (2005)
-
Volume 1 (2004)
Most Read This Month
