Volume 14, Issue 9

Letters in Organic Chemistry - Volume 14, Issue 9, 2017

Volume 14, Issue 9, 2017

- Editorial: Development and Application of Feature Selection Techniques in Protein Data Analysis and Prediction
  
  By Hao Lin
  
  https://doi.org/10.2174/157017861409170929151809
  More Less
  
  Add to my favourites
  
  Email this

- Identification of Secretory Proteins of Malaria Parasite by Feature Selection Technique
  
  Authors: Hua Tang, Chunmei Zhang, Rong Chen, Po Huang, Chenggang Duan and Ping Zou
  
  https://doi.org/10.2174/1570178614666170329155502
  More Less
  
  Background: Malaria is one of the major infectious diseases caused by Plasmodium falciparum (P. falciparum). The proteins secreted by malarial parasite play important roles in drug design in anti-malaria. Thus, it is very important to accurately identify secretory proteins of malarial parasite. Although biochemical experiments can solve the issue, it is both time- and money-consuming. Computational methods provide an important tool for fast and correct identification of the proteins secreted by malaria. Method: The aim of the letter is to design a powerful prediction model to identify the secretory proteins of malarial parasite. In this model, the physicochemical properties of residues were incorporated into traditional pseudo amino acid composition to discretely formulate the secretory protein samples. Subsequently, the optimal feature subset was obtained by analysis of variance (ANOVA). Finally, the support vector machine was proposed to perform classification. Results: In 5-fold cross-validation test, the overall accuracy reached 91.3%. Comparison with other method proves that the proposed method is powerful and robust. Conclusion: This study demonstrates that the novel properties are important features for secretory protein prediction.
  
  Add to my favourites
  
  Email this

- Identify Protein 8-Class Secondary Structure with Quadratic Discriminant Algorithm based on the Feature Combination
  
  Authors: Zhao Wei and Feng Yonge
  
  https://doi.org/10.2174/1570178614666170609105326
  More Less
  
  Background: The research of protein structure is one of the most important subjects in the 21st century. However, the prediction of protein secondary structure is a key step in the prediction of protein three-dimensional structure. Protein eight-class secondary structure (SS) prediction has gained less attention and the implementation of three-class secondary structure (SS) prediction has been done in the past. Method: We introduced a model for the prediction of protein eight-class secondary structure using quadratic discriminant algorithm (QDA) based on the feature combination. We combined chemical shifts with the measure of diversity as features. The measure of diversity is based on the hydrophilichydrophobic residues and their dipeptides respectively. Firstly, we extracted the chemical shifts in protein as features. Then, we implemented the eight-class secondary structures prediction using these chemical shifts as features. In order to improve the accuracy, we constructed the measure of diversity based on the hydrophilic-hydrophobic residue. Finally, we combined chemical shifts with the measure of diversity to predict protein eight-class secondary structures. Results: We achieved the best accuracy of eight-class secondary structures (Q8) 80.7% in seven-fold cross-validation combining chemical shifts with the measure of diversity. In the same data set, we performed the prediction by C8-Scorpion sever, support vector machine (SVM) and random forest (RF) and the results showed that our prediction model is superior to other algorithms in terms of accuracy. Conclusion: The finding suggested that our model is an effective model for the prediction of protein eight-class secondary structures.
  
  Add to my favourites
  
  Email this

- Improved Identification of Cytokines Using Feature Selection Techniques
  
  Authors: Limin Jiang, Zhijun Liao, Ran Su and Leyi Wei
  
  https://doi.org/10.2174/1570178614666170227143434
  More Less
  
  Background: Cytokines, as small signaling proteins, play critical roles in biological functions and are closely related with human diseases. Accurate identification of cytokines is the first step to provide insights into the relevance of cytokines and human diseases. In recent years, many research efforts have been done for the development of computational methods, especially for machine learning based methods, to fast and accurately identify cytokines. Currently, a major challenge lying in existing machine learning based methods is to improve the performance of cytokine identification. Method: In this study, we attempt to enhance the performance of cytokine identification methods from the two following factors: (1) feature representation and (2) classifier selection. For feature extraction, we fuse multiple types of features showing good performance to classify cytokines from noncytokines, and employ two feature selection techniques, Max-Relevance-Max-Distance (MRMD) and Principal Components Analysis (PCA), to yield the optimal feature representations. For classifier selection, various powerful classifiers are performed, and the one with the highest performance is determined to build the classification model for our method. Results: Based on the analysis, we learned that our feature sets stably maintain high performance with any of the classifier we used. And, the overall performances of the combinations were in the following order from best to worst: 473D+LIBSVM, MRMD+LIBD3C, and PCA+LIBSVM. Conclusion: Comparative studies demonstrate that our proposed strategy is effective for the improved performance in identification of cytokines.
  
  Add to my favourites
  
  Email this

- Application of Feature Selection Technology Based on Incremental of Diversity in Prediction of Flexible Regions from Protein Sequences
  
  Authors: Suqing Yang, Shisai Hu, Ying Zhang and Jun Lv
  
  https://doi.org/10.2174/1570178614666170221145333
  More Less
  
  Background: The flexibility of protein structures is often related to the function of the protein. Feature selection (FS) is very critical to the application of a lot of machine learning which deals with small sampling and high-dimensional data. For the prediction of the flexible regions by the protein sequences, it is important to build a machine learning methodology which is based on an effective feature selection technology. This may also provide new knowledge to understand the protein folding process. Method: Firstly, the frequencies of the k-spaced amino acid pairs are taken as a representation of the local sequences. Secondly, these representations are processed by feature selection based on incremental of diversity (FSID) to reduce the dimensionality. Finally, the logistic regression approach is applied to integrate the selected features into a scheme to discriminate flexible or rigid (referred to as FSID_FRP). Results: 74 features are selected from the set of 66 sequences, which includes 26 flexible patterns and 48 rigid patterns. Most of the flexible patterns are associated with Glycine or Proline, and the rigid patterns are associated with Leucine or Valine. We obtained 79.41% accuracy and 0.51 MCC using the FSID_FRP method in which we applied logistic regression and used the representation of the 74 features. The results of FSID_FRP method are comparable to that of FlexRP method that includes 95 features. Conclusion: A simple feature selection method FSID is shown to be very efficient in the prediction of the flexible/rigid regions of protein sequences. This method is more appropriate for small-sampling classification than the entropy-based feature selection method. The proposed FSID_FRP method achieved 80% prediction accuracy and stronger generalization ability.
  
  Add to my favourites
  
  Email this

- Prediction of Protein Folding Rates from the Amino Acid Sequencepredicted Backbone Torsion Angles
  
  Authors: Hui Liang, Lingling Wang, Ying Zhang, Changjiang Ding and Jun Lv
  
  https://doi.org/10.2174/1570178614666170608130848
  More Less
  
  Background: The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Prediction of protein folding rates from 3D structures is more common and more accurate, but there are a few methods to accurately predict the folding rates from sequences. Therefore, it is important to develop an accurate method of predicting protein folding rates from sequences of proteins with unknown structures. Objective: We proposed a highly accurate sequence-based prediction method to predict the rate of in-water protein folding directly from its primary structure, which does not need any information of its 3D fold. Method: It uses ANGLOR to predict real-value of protein backbone torsion angles from amino acid sequences, and then calculate cumulative backbone torsion angles (CBTA). Our estimate is based on the Pearson correlation coefficient between the folding rate and the natural logarithm of predicted CBTA. Results: The method achieves 79% correlation with experiment over all 100 “two-state” and “multistate” proteins (including two artificial peptides) studied up to now. This is better than the results of existing sequence-based prediction methods which include the effective length of the folding chain (Leff) and the number of predicted long-range contacts (LROpred). Conclusion: We found a new parameter of protein folding rates, i.e., cumulative backbone torsion angles, and gave a highly accurate sequence-based method of predicting folding rates. On the one hand the CBTA is a coarse-grained description for distribution of protein backbone torsion angles which determines the basic topology structure of the protein, on the other hand, the CBTA is proportional to protein length. Therefore, a strong correlation exists between the CBTA and folding rate. This is the reason why we can successfully predict the folding rates from the amino acid sequence-predicted backbone torsion angles.
  
  Add to my favourites
  
  Email this

- iDHSs-PseTNC: Identifying DNase I Hypersensitive Sites with Pseuo Trinucleotide Component by Deep Sparse Auto-encoder
  
  Authors: Zhao-Chun Xu, Shi-Yu Jiang, Wang-Ren Qiu, Ying-Chun Liu and Xuan Xiao
  
  https://doi.org/10.2174/1570178614666170213102455
  More Less
  
  Background: DNase I hypersensitive sites (DHSs) are important signs of DNA regulatory regions. Their identification in DNA sequences is significant for both the biomedical research and the discovery of new drugs. The existing experimental methods to achieve this, however, are timeconsuming and laborious, so new computational means are called for. Method: To meet this end, a novel predictive model, called iDHSs-PseTNC, was constructed by integrating the sequence-order information and the physicochemical properties of trinucleotides into the pseudo trinucleotide composition (PseTNC). In the model, the deep sparse auto-encoder was used for reconstructing the input to get a good representative of the input characteristic, and a softmax classifier was added to the top of the auto-encoder coding layer. The deep sparse auto-encoder model obtained the best classification result with each member of the training set correctly classified. Five-fold crossvalidation test results indicated that the new predictor remarkably outperformed the existing prediction methods for the same purpose. Results: In this paper, the ACC rate of iDHSs-PseTNC is slightly (0.3%) lower than that of iDHS-EL constructed by Liu et al., its MCC rate is 3.45% higher than that of iDHS-EL. And the predictor iDHSs-PseTNC achieves the highest successful rates in both Pt and Py among the existing predictors. In order to facilitate the direct derivation of the needed results by experimental scholars, an easy-to-use web-server for identifying DHSs has been established for free access at: http://www.jcibioinfo. cn/iDHSs-PseTNC, which allows for fast and accurate computation. Conclusion: The timely identification of the DHSs in DNA sequence is significant for the intensive study on DNA function and the development of new drugs. In this article, we proposed a novel method for predicting the DHSs of DNA by incorporating physicochemical properties of trinucleotides into pseudo trinucleotide composition via deep sparse auto-encoder. The results were promising enough for our predictor to be used as an analytic solution to more genomic problems.
  
  Add to my favourites
  
  Email this

- Predicting S-sulfenylation Sites Using Physicochemical Properties Differences
  
  Authors: Guo-Cheng Lei, Jijun Tang and Pu-Feng Du
  
  https://doi.org/10.2174/1570178614666170421164731
  More Less
  
  Protein S-sulfenylation plays a critical role in pathology and physiology. Detecting S-sulfenylated proteins in cells is of great value in medical and life sciences. Several computational methods have been developed to predict S-sulfenylation sites. However, the prediction performances are still not ideal. Method: We developed a computational method to predict S-sulfenylation sites by utilizing physicochemical property differences to represent sequence segments around S-sulfenylation sites. By using a clustering method to partition the training set, we developed a novel prediction method using an ensemble classifier. Results: Our method achieves an overall accuracy of 69.88% on the benchmarking dataset. We compared our method to the other state-of-the-art methods. Our method performs better than all existing methods. Conclusion: We proposed a computational method to predict S-sulfenylated sites, which outperforms other state-of-the-art methods.
  
  Add to my favourites
  
  Email this

- Predicting Protein Structural Class for Low-Similarity Sequences via Novel Evolutionary Modes of PseAAC and Recursive Feature Elimination
  
  Authors: Liang Kong, Lingfu Kong, Changwu Wang, Rong Jing and Lichao Zhang
  
  https://doi.org/10.2174/1570178614666170511165837
  More Less
  
  Background and Objective: Protein structural class prediction is a first and key step in protein structure prediction and has become an active research area in biochemistry and bioinformatics. An important aspect for this prediction task is exploring good feature representation. Prior works have demonstrated the effectiveness of the PSI-BLAST profile based feature extraction methods especially for low-similarity protein sequences. However, the prediction accuracies still remain limited. This highlights the need for keeping on exploring the potential of evolutionary information. Method: In this study, three novel sequence evolutionary modes of pseudo amino acid composition (PseAAC) are proposed and optimized by a two-stage feature selection process based on recursive feature elimination strategy. The selected top-ranking features are then fed into a linear kernel support vector machine classifier to predict the protein structure class. To evaluate the performance of the proposed method, jackknife tests are performed on three widely used low-similarity benchmark datasets (25PDB, 1189 and 640). Results: With comprehensive comparison with the current state-of-the-art methods, the proposed method achieves superior performance. The overall accuracies on 25PDB, 1189 and 640 datasets are 96.2%, 97.9% and 99.5%, which are 1.9%, 1.5% and 2.3% higher than previous best-performing method. Conclusion: The satisfactory prediction accuracies achieved by the proposed method are attributed to the specially designed sequence evolutionary modes of PseAAC and the effective feature selection strategy, which cover more discriminative sequence order information. It is anticipated that our method would be helpful in other prediction problems in protein research.
  
  Add to my favourites
  
  Email this

- Predicting the Types of Plant Heat Shock Proteins
  
  Authors: Jing Ye, Wei Chen and Dianchuan Jin
  
  https://doi.org/10.2174/1570178614666170221144023
  More Less
  
  Background: Heat shock proteins (HSPs) ubiquitously expressed in both prokaryotes and eukaryotes. According to their molecular mass and function, HSPs are classified into different families which are structurally different and play distinct functions in biological processes. Although some efforts have been made for identifying the types of HSPs, there is no method available that can be used to identify the types of HSPs in plants. Method: The amino acid distributions in the different types of HSPs are anazlyed. HSPs are encoded using the reduced amino acid alphabet (RAAA). By comparing the predictive capability of models based on the composition of RAAA with different sizes, the optimal feature vector was obtained. A support vector machine based model was developed to identify the types of HSPs by using the optimal feature vector. Results: The amino acid distributions are different among the different families of HSPs. In the rigorous jackknife test, the proposed method obtained an accuracy of 93.65% for identifying the five families of HSPs in plant. Conclusions: We hope the proposed method will become a useful tool to identify the types of HSPs in plants.
  
  Add to my favourites
  
  Email this

- BRAda: A Robust Method for Identification of Pre-microRNAs by Combining Adaboost Framework with BP and RF
  
  Authors: Ningyi Zhang, Ying Zhang, Tianyi Zhao, Jun Ren, Yangmei Cheng and Yang Hu
  
  https://doi.org/10.2174/1570178614666170221144619
  More Less
  
  Background: MicroRNAs (miRNAs) are a set of non-coding, short (approximately 21nt) RNAs that play an important role as a regulator in biological processes in the cells. The identification and discovery of pre-miRNAs are beneficial in understanding the regulatory process, the functions of miRNAs and other genes, and furthermore in biological evolution. Methods: Machine learning method has been a powerful technology in distinguishing the real premiRNAs from other hairpin-like sequences (pseudo pre-miRNAs). However, most of the commonly used classifiers are not promising in predicting performances on independent testing data sets. To overcome this, we proposed a novel BRAda algorithm integrating BP neural network and random forest classifier based on two balanced training sets. By distributing weights to these classifiers and the proposed 98-dimensional features, we obtained a strong classifier with high-accuracy and stability. Furthermore, based on the novel classifier we proposed, two independent testing sets (undated human and non-human pre-miRNAs) were employed to evaluate the prediction performance. Results: The novel method BRAda algorithm is significantly outperformed the other methods in identifying both human and non-human pre-miRNAs. Conclusion: The novel algorithm integrated BP neural network and random forest classifier based on two balanced training sets. Compared with other state-of-art machine-learning methods, the performance of BRAda was perfect (the ACC is over 99%) according to the validation. Besides, though the algorithm was trained by human gene sets, the prediction performance on non-human testing sets was also excellent (the average ACC is over 97%), which means the method not only has high stability but also robustness. By experiments and validation, the authors showed the method is an effective tool for pre-miRNA identification.
  
  Add to my favourites
  
  Email this

- Prediction of Protein Structural Class Based on ReliefF-SVM
  
  Authors: Xianfang Wang, Yue Zhang and Junmei Wang
  
  https://doi.org/10.2174/1570178614666170725151750
  More Less
  
  Background: The knowledge of protein structural class plays an important role in understanding its tertiary structure. The globular protein domains, whose fold types are surprisingly similar, in spite of complex and irregular in natural condition, can be mainly divided into the following four classes#154; all – α, all – β, α /β, and alpha; +β according to secondary structural content. Various significant efforts have been made to predict protein structural classes. However, the information of protein sequence representation may exist redundancy in these approaches. Method: The Relief F-SVM classification model was proposed to predict protein structural class. First, pseudo amino acid compositions (PseAA) features were extracted from each protein in the dataset, where features redundancy exists. Then, we used Relief F feature extraction method to reduce redundancy. Next, the optimized samples were given as input into the SVM. As the parameters were difficult to assure, the Simulated Annealing Particle Swarm Optimization (SAPSO) algorithm was embedded into the SVM. Results: After the features are selected by the ReliefF algorithm, the dimension of the features was reduced from 420 to 292. The time of experiment reduced from 372.32s to195.58s, time-consuming reduced by nearly half. We compared it with the other existing methods to evaluate our method objectively. For the C204 dataset, the overall classification accuracy was 95.4% obtained using our method, which was 14.5% higher than the covariant matrix algorithm. Compared with the previous SVM, our method has improved by 10.1%. Under the circumstances of consistent feature data, the proposed method had 4.6% improvement over IDQD. As shown, the overall accuracy of the proposed method for the Z277 dataset achieved 96.5%, being higher than those of other methods. Conclusion: The results found in this study further support the results of the description of protein sequence reported by Lin, and our method reduces the time-consumption by 47%. The accuracy of the prediction classification is also greatly improved, which proves the effectiveness of our method.
  
  Add to my favourites
  
  Email this

- iRSpotH-TNCPseAAC: Identifying Recombination Spots in Human by Using Pseudo Trinucleotide Composition With an Ensemble of Support Vector Machine Classifiers
  
  Authors: Zhao-Chun Xu, Wang-Ren Qiu and Xuan Xiao
  
  https://doi.org/10.2174/1570178614666170608125909
  More Less
  
  Background: For the formation of human gametes, meiotic recombination is crucial. Meanwhile, it has played an important role in the process that generates genetic diversity for that it is a defining event in the formation of human sperm and eggs. However, the recombination isn't a random occurrence across a genome, it usually occurs in some genomic regions, the so-called “hotspots”, with higher probability, while in the so-called “coldspots” with lower probability. Research has shown that new combinations of genetic variations can be provided by recombination. Therefore, the useful insights for in-depth studying of the genome evolution process and the mechanism of recombination would be provided based on the information of the coldspots and hotspots. Currently, the recombination regions would be determined by experiments, but it's a tedious job, which generally requires precious instruments and takes a long time. So in the study the work is starting to be studied by computational predicting models to address the above problems. Method: In this paper, a new predictor, called ‘iRSpotH-TNCPseAAC’ was developed to identify the human recombination coldspots and hotspots. In the new discrete predictive model, a feature vector called ‘pseudo trinucleotide composition’ or PseTNC is proposed to formulate the given DNA segment with its sequence-order information as complete as possible. Results: In this study, based on the rigorous jackknife test the overall success rate obtained by iRSpotH- TNCPseAAC is higher than 93% in identifying human’s recombination spots, and with mean success rate is 76.07% of the concerned 18 chromosomes. It means that our predictor can become a useful complementary tool in this area. Not only that, the PseTNC method can be used to further explore many other DNA-related problems. Finally, a web- server called iRSpotH-TNCPseAAC, which has the advantages of easy operation and convenient for using, is built and freely accessible at http://www.jci-bioinfo.cn/iRSpotH-TNCPseAAC. Conclusion: To timely acquire the information of recombination spots in DNA sequence is very significant to make in-depth study on epigenetic inheritance and analyze human diseases. Furthermore, it will facilitate drug development. A certain conclusion is that the iRSpotH-TNCPseAAC predictor may become a very practical online predictive high throughput tools in identifying recombination spots.
  
  Add to my favourites
  
  Email this

- FledFold: A Novel Software for RNA Secondary Structure Prediction
  
  Authors: Qi Zhao, Yuanning Liu, Yunna Duan, Tao Dai, Rui Xu, Hao Guo, Daiming Fan, Yongzhan Nie and Hao Zhang
  
  https://doi.org/10.2174/1570178614666170419122621
  More Less
  
  Background: RNA secondary structure is essential to understand the mechanism of RNAs. Method: In this paper, fledFold, a novel software for RNA secondary structure prediction, is introduced. It combines both thermodynamic and kinetic factors of RNA secondary structures and can predict RNA secondary structures from their primary sequences with local personal computers. Results: FledFold is implemented in C++ under Windows 7 and could run on windows 7 or later version with at least 2 GB of RAM. Fledfold is user friendly and could output results with multiple formats. Conslusion: FledFold will be a valuable tool for RNA researches and it could be downloaded freely from http://www.jlucomputer.com/fledfold.php
  
  Add to my favourites
  
  Email this

- Prediction of Protein Subcellular Localization by Using λ-Order Factor and Principal Component Analysis
  
  Authors: Shengli Zhang and Jin Jin
  
  https://doi.org/10.2174/1570178614666170227142225
  More Less
  
  Background: Protein subcellular localization is closely related to its function, and also maintains highly ordered cell guarantee for normal operation of the system. Studies of protein subcellular localization are very helpful to understand the properties and functions of protein, understand the interaction between proteins and regulation mechanism, understand the pathogenesis of some diseases and develop new drug. However, the traditional biological experiments are both time consuming and costly. Therefore, development of fast and effective machine learning method for predicting protein subcellular localization is very necessary. Method: We propose a new method about extracting features based on pseudo amino acid composition called λ-order factor method. At the same time, we combine principal component analysis with our proposed method. Thus, not only protein sequences' physicochemical properties have been considered, but also sub-sequences sort information. Meanwhile, this measure eliminates duplicate information and reduces the dimension of feature vectors. Finally, the SVM and the10-fold cross validation test are employed to predict and evaluate the method on three benchmark datasets: ZD98, ZW225 and CL317. Results: With comprehensive comparison of the current state-of-the-art methods, the proposed method achieves superior performance. The overall successful rate of ZD98, ZW225 and CL317 datasets is 90.8%, 85.3% and 89.6%, respectively. The results show that our method has a better classification performance than others. Conclusion: The numerical results show that our model successfully extracts the protein sequences' physicochemical information and sort information based on pseudo amino acid composition (Pse- AAC), and provides a reliable PseAAC-based method as a potential candidate for apoptosis protein subcellular localization prediction.
  
  Add to my favourites
  
  Email this

Letters in Organic Chemistry - Volume 14, Issue 9, 2017

Volume 14, Issue 9, 2017

Volumes & issues

Most Read This Month

Most Cited Most Cited RSS feed