International Journal of Biological Sciences

Impact factor
4.057

ISSN 1449-2288

News feeds of IJBS published articles
Manuscript login
Account
Submit

open access Global reach, higher impact

Journal of Genomics in PubMed Central. Submit manuscript now...

Theranostics

International Journal of Medical Sciences

Journal of Cancer

Journal of Genomics

Journal of Bone and Joint Infection (JBJI)

Oncomedicine

Journal of Biomedicine

Nanotheranostics

PubMed Central Indexed in Journal Impact Factor

Int J Biol Sci 2018; 14(8):863-871. doi:10.7150/ijbs.24588

Research Paper

Identification of Inhibitors of MMPS Enzymes via a Novel Computational Approach

Jian Song1,2,4, Jijun Tang1,2,3, Fei Guo1,2 Corresponding address

1. School of Computer Science and Technology, Tianjin University, Tianjin 300350, China;
2. Tianjin University Institute of Computational Biology, Tianjin University, Tianjin 300350, China;
3. Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA;
4. School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, China.

This is an open access article distributed under the terms of the Creative Commons Attribution (CC BY-NC) license (https://creativecommons.org/licenses/by-nc/4.0/). See http://ivyspring.com/terms for full terms and conditions.
How to cite this article:
Song J, Tang J, Guo F. Identification of Inhibitors of MMPS Enzymes via a Novel Computational Approach. Int J Biol Sci 2018; 14(8):863-871. doi:10.7150/ijbs.24588. Available from http://www.ijbs.com/v14p0863.htm

Abstract

Matrix metalloproteases (MMPs) are a family of zinc-dependent proteinases that play complex and diverse roles in metabolism, which are vital for physiological development. In this paper, we present a novel method to identify peptide binding to seven matrix metalloproteases. First, we propose a novel sampling criteria for constructing a training set for each new peptide motif. Then, we select nine physicochemical properties of amino acids and compute their auto-cross covariance to effectively extract features for both natural and non-natural amino acids. Finally, we adopt random forest to predict binding values of each peptide motif respectively with seven MMPs. Our method verifies on 1300 known peptide motifs binding to seven MMPs and achieved preeminent Pearson-product-moment correlation coefficient (PCC) and root mean squared error (RMSE) on all seven MMPs, especially of 0.9181 and 9.3827 on MMP-7. We predict binding values of 4000 peptide motifs and identify peptides preferentially bind to MMP-2 and MMP-7. We herein report 4 novel inhibitor candidates of Asp-Ile-Phe, Asp-Ile-Tyr, Asp-Ile-Lys and Hser-Gly-Phe with high potency and selectivity binding to MMP-2, as well as 6 novel inhibitor candidates of Chg-Ile-Ile, Chg-Ile-Leu, Chg-Ile-Glu, Chg-Ile-Met, Chg-Val-Ile and Chg-Val-Leu selectively binding to MMP-7. Our findings facilitate the identification of inhibitors with good potency as well as desirable selectivity, providing significant insights of candidate inhibitor drugs.

Keywords: MMPs, peptide inhibitors, auto-cross covariance, random forest

Introduction

Matrix metalloproteases (MMPs) are a family of zinc-dependent proteinases that play complex and diverse roles in metabolism, which are vital for physiological development. It has been revealed that MMP-2 and MMP-7 directly accelerate tumori-genesis, which means these enzymes as vital disease targets [1]. On the other side, some members in the MMP family often confer protective effects in various human diseases, improving host resistance towards cancer and other abnormalities [2]. For example, knocking-out certain MMPs (MMPs-3, -8 and -9) has been found directly linked to tumor proliferation in animal models of several cancers, emphasizing the positive roles mediated by selective members of the MMP family [3]. Hence, there have accordingly been intense interests in developing effective small-molecule drugs with strong selectivity against specific negative members of this class of enzymes.

Nevertheless, MMPs have highly conserved mechanisms and share some active sites. Several MMP inhibitors which were at first selected and optimized on the basis of good potency came into extensive phase III clinical trials, only to be discovered ineffective because of problems arising from a lack of selectivity [4]. This raises a major impetus and a big challenge to develop compounds with not only good potency but also high selectivity [5]. Ideally, such inhibitors should inhibit only target MMPs (MMP-2 and MMP-7) responsible for the relevant disease, while minimally affecting any anti-target MMPs (MMP-3, MMP-8 and MMP-9), which may be beneficial for human-being.

There have been several experimental strategies proposed in order to address these pressing challenges. Rao et.al proposed a well-accepted strategy involving grafting short peptide chains to zinc binding groups (ZBG) [6]. Yao and colleagues presented an effective experimental strategy to generate clustered enzymes “fingerprints” through high-throughput screening of focused inhibitors libraries [7]. They have adopted the hydroxamate (CONH-O-) group that chelates strongly to the metal center at the enzyme active site and permuted across the Int J Biol Sci inline graphic, Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions, creating a diverse repertoire of 1400 individual inhibitor scaffolds by adopting the split-pool directed sorting synthesis method. In the library, the Int J Biol Sci inline graphicconsists of 6 natural amino acids and 5 non-natural amino acids (CPA3, CHG, HPE, Int J Biol Sci inline graphic, HSER) [8] (respectively set as single letters of U, B, Z,Int J Biol Sci inline graphic, J), which are made of substituted succinyl hydroxamate ZBG (highlighted in pink) as shown in Scheme 1. The Int J Biol Sci inline graphic, and Int J Biol Sci inline graphic positions respectively consist of 20 natural amino acids. As a result, they reported a data set acquired by seeing a comprehensive panel of 1400 peptide hydroxamates respectively for seven different MMPs, providing unique insights to inhibitor design and preference within this important group of enzymes. However, variation at three positions of hydroxamates peptide can cause differences in binding affinities in total of 4400 possibilities. A large part of binding values of these samples, which can provide nontrivial insights and assistance for inhibitor design, is still missing. Using experimental method to obtain all the missing sequences is expensive, time-consuming and labor-extensive. Hence, we construct a computational model to predict MMP-specific binders from experimental data.

In this paper, we propose the first computational method to identify and analyze MMPs hydroxamates peptides' binding specificity. First, we propose a sampling criteria to construct a training set for each new peptide motif. Then, we select nine physicochemical properties of amino acids, which can effectively describe the differences among amino acids and can also be obtained across non-natural amino acids. We also proposed features of auto-cross covariance [9, 10], extracting correlative properties of amino acids in any two positions. Finally, we adopt random forest to predict binding values of each peptide motif respectively with seven MMPs. On MMP-7, our method has achieved overall Pearson-product-moment correlation (PCC) and root mean squared error (RMSE) values of 0.9181 and 9.3827. The high values of PCC and RMSE of our method on all seven MMPs have proven the rationality and effectiveness of our computational method. In the end, we find 4 novel peptides that selectively bind to MMP-2, including Asp-Ile-Phe, Asp-Ile-Tyr, Asp-Ile-Lys and Hser-Gly-Phe. We also identify 6 novel peptides with high selectivity binding to MMP-7 of Chg-Ile-Ile, Chg-Ile-Leu, Chg-Ile-Glu, Chg-Ile-Met, Chg-Val-Ile and Chg-Val-Leuor, providing instructive insights for further experiment design and detection of highly selective inhibitors of MMPs.

Methods

We present the first computational method of MMPs peptide-binding specificity identification. For each MMP, we have 1400 peptide motifs with experimental binding affinity values, treated as known in this study. To identify binding values of 4400 peptide sequences binding to seven MMPs, we firstly propose a sampling criteria to construct an affinity-based training set for each peptide motif. Then we select 9 physicochemical properties of amino acids to describe each peptide motif. We also use auto-cross covariance to extract correlative properties of amino acids in any two positions. Finally, we consider Random Forest to predict affinity values of peptide motifs. The method is shown in Figure 1.

Data set

Yao and co-workers proposed a small but highly diversified 1400-member peptide library [7]. The library was prepared in two parts. First, a 400-member sub-library containing a leucine side chain at Int J Biol Sci inline graphic position (represented with single-letter code L) was constructed with permutations of all 20 natural amino acids across Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions. Second, an additional 1000-member set was constructed with the remaining 10 amino acids at Int J Biol Sci inline graphic position containing side chains of both natural and non-natural amino acids (Scheme 1). The Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions in this set were systematically permuted with 10 proteinogenic amino acids, specifically nonpolar (Ala, Leu, Phe, Trp), charged polar (Glu, Lys, His) and uncharged polar (Gln, Ser, Tyr) amino acids. Yao experimented the 1400 peptides and obtained their binding values respectively with seven MMPs to identify selective peptide [7].

For each MMP, there are Int J Biol Sci inline graphic possibilities of inhibitor peptides in total. There could still be a significant number of peptides with high potency and selectivity in the remaining untested 3000 peptides. Hence, we construct a regression model to predict binding values of non-experimental peptides to find effective peptides with high potency and selectivity. For each MMP isoform, we have 1400 peptide motifs with experimental binding affinity values. However, as the physicochemical properties of the amino acid with sulfone side chain (Int J Biol Sci inline graphic) are unobtainable, the peptide motifs containing an Int J Biol Sci inline graphic can't be effectively described or further be used as training data. Thus we forgo Int J Biol Sci inline graphic and the peptides with Int J Biol Sci inline graphic. As a result, we use 1300 peptide motifs as training samples to predict the non-experimental peptides. The non-experimental peptides which can be effectively predicted are also the ones without Int J Biol Sci inline graphicon Int J Biol Sci inline graphic position. There are, hence, Int J Biol Sci inline graphic peptides' binding values predicted by our regression model.

 Scheme 1 

The optional non-natural and natural amino acids for three positions. The Int J Biol Sci inline graphic consists of 11 non-natural and natural amino acids made of substituted succinyl hydroxamate ZBG (highlighted in pink). Each was assigned a unique single-letter code (inset).

Int J Biol Sci Image (Click on the image to enlarge.)
 Figure 1 

The overall method flow.

Int J Biol Sci Image (Click on the image to enlarge.)

Sampling Criteria

We propose a sampling criteria to build a predictor for each new peptide motif. If all 1300 peptide motifs are used to construct a regression model, the predictor would be confused due to importing many irrelevant peptide sequences. Here, we exploit a similarity-based sampling approach. All 20 natural amino acids and 5 non-natural amino acids are divided into 5 categories [11, 12]: amino acids with positive charged side chains, amino acids with negative charged side chains, amino acids with polar uncharged side chains, amino acids with hydrophobic side chains and special cases. The details are shown in Table 1.

 Table 1 

Five categories of 20 natural amino acids and 5 non-natural amino acids

CategoryAmino Acids
Amino acids with positive charged side chainsR, H, K, B
Amino acids with negative charged side chainsD, E, J, Z
Amino acids with polar uncharged side chainsS, T, N, Q
Amino acids with hydrophobic side chainsA, I, L, M, F, W, Y, V
Special casesC, G, P, U, Int J Biol Sci inline graphic

We propose an evaluation of similarity between two peptide samples based on the similarity Int J Biol Sci inline graphic of amino acid categories. We calculate the similarity between the training peptide sample Int J Biol Sci inline graphicand the target peptide sequence Int J Biol Sci inline graphicas follows:

Int J Biol Sci inline graphic
Int J Biol Sci inline graphic  (1) 

where Int J Biol Sci inline graphic respectively denotes corresponding amino acid on the i-th position of training and target peptide; Int J Biol Sci inline graphic represents the amino acid similarity: if two amino acids belong to the same category, the similarity Int J Biol Sci inline graphicon this position is 1, otherwise is 0.

For each target sequence motif, we choose samples which have similarity values of at least 1 (Int J Biol Sci inline graphic), which means each sample at least has one position's amino acid belonging to the same category with amino acid on the corresponding position of target peptide. Compared with other random-based sampling approach, the similarity-based sampling strategy takes similarity into consideration and hence filter the irrelevant samples.

Feature Extraction

The computational methods have been widely used for classifying peptides or predicting binding values of small-molecules containing natural amino acids. [13, 14, 15, 16, 17]. However, there have been challenges employing computational methods to peptides containing non-natural amino acids because it's hard to extract effective features to describe and differentiate non-natural amino acids. We herein propose two kinds of features in this study to effectively describe both non- and natural amino acids: one extracts nine physicochemical properties for each position and this produces 27 features; the other extracts correlation of amino acids in any two positions of auto-cross covariance, nine features for every two positions, thus leads to another 27 features.

Physicochemical Properties

We compute 9 physicochemical properties [18] of all 20 amino acids and 4 non-natural amino acids (B, J, T, U) to describe each peptide motif using E-dragon [19] and MOE programs [20]. The amino acid with sulfone side chain Int J Biol Sci inline graphic has been omitted, due to its physicochemical properties unable to be computed. These 9 physicochemical properties consist of Molecular Weight (MW), Sum of Atomic Van Der Waals Volumes (SV), Sanderson Electronegativity (SE), Polarizability (P), Number of hydrogen bonds (HB), Eccentric Connectivity Index (CSI), Eccentricity (ECC), Sphericity (SPH), Hydrophilic factor (HY). Details are shown in Table 2. These nine physicochemical properties are normalized to zero mean and unit standard deviation [21, 22, 23]. The first kind of 27 features can be extracted from these normalized properties as follows:

Int J Biol Sci inline graphic  (2) 

where Int J Biol Sci inline graphic represents the mean of the j-th property, Int J Biol Sci inline graphic is the j-th property of the i-th amino acid, Int J Biol Sci inline graphic is the corresponding unit standard deviation.

 Table 2 

Nine physicochemical properties of 20 natural amino acids and 5 non-natural amino acids.

AAMWSVSEPHBCSIECCSPHHY
ALAA90.127.1114.357.58224160.5763.794
GLYG75.085.2110.525.44219130.8282.870
IIEI132.2111.9023.0012.86261370.5572.926
LEUL132.2111.9023.0012.86263380.5002.926
PROP116.169.7118.2310.34251270.6512.001
VALV118.1810.3120.1211.10243270.4103.150
PHEF166.2214.3124.1215.102131680.8312.456
TRPW205.2617.3028.2218.113195950.8523.198
TYRY182.2214.8225.4415.563157820.7873.446
ASPD134.139.1318.009.49463380.7084.320
GLUE148.1610.7320.8911.25485500.7704.068
ARGR176.2614.5928.3615.504139790.8638.560
HISH156.1912.1021.5512.594106550.6443.857
LYSK148.2413.2026.0414.25298570.8366.438
SERS106.127.6215.688.03436230.6524.875
THRT120.159.2218.569.80443270.4504.508
CYSC122.198.2015.439.23236230.6684.875
METM150.2511.3921.1912.75274440.8103.032
ASNN133.159.6218.7810.04463380.7345.574
GLNQ147.1811.2121.6611.80485500.7875.271
CHGB158.2514.5026.8815.632101530.4772.587
HSERJ120.159.2218.569.80454330.7354.508
HPEZ180.2515.9027.0016.862163840.7622.342
CPA3U157.2414.2025.9415.242106550.5951.574

MW, Molecular Weight; SV, Sum of Atomic Van Der Waals Volumes; SE, Sanderson Electronegativity; P, Polarizability; HB, Number of hydrogen bonds; CSI, Connectivity Index; ECC, Eccentricity; SPH, Sphericity; HY, Hydrophilic factor.

Auto-Cross Covariance

We also use auto-cross covariance to extract correlation of amino acids in any two positions. Auto-cross covariance (ACC) can get two kinds of variables, auto cross (AC) between the same descriptor, and cross covariance (CC) between two different descriptors. In this study, we only use AC variables in order to avoid generating too large number of variants. We modify the AC variables to get correlation of amino acids in any two positions as follows:

Int J Biol Sci inline graphic
Int J Biol Sci inline graphic  (3) 

where m, n are different position of a peptide and j is the j-th property of residues, Int J Biol Sci inline graphicis the j-th property of residue on the i-th position.

Random Forest

The training algorithm for random forest [24, 25, 26] applies the general technique of bagging. Given a training set of Int J Biol Sci inline graphic with responses Int J Biol Sci inline graphic, bagging repeatedly and randomly selects a sample for B times with replacement of the training set and fits trees to these samples:

For b = 1, 2, …, B:

  • Sample, with replacement, n training examples from X, Y; call these Int J Biol Sci inline graphic, Int J Biol Sci inline graphic.
  • Train a regression tree Int J Biol Sci inline graphic on Int J Biol Sci inline graphic, Int J Biol Sci inline graphic.

After training, predictions for unseen samples Int J Biol Sci inline graphiccan be obtained by averaging the predictions from all the individual regression trees on Int J Biol Sci inline graphic:

Int J Biol Sci inline graphic  (4) 

We calculate a five-fold cross-validation and permute from 100 to 5000 with step of 1 to get the optimal number of regression trees for random forest regression model. We create an ensemble of 1500 regression trees for predicting non-experimental peptides' binding values.

Results

The Int J Biol Sci inline graphic position contains 5 non-natural amino acids, one of which has a side chain of sulfone (Int J Biol Sci inline graphic). Among the 1400 experimented samples, there are 10 x 10= 100 peptides with Int J Biol Sci inline graphic on Int J Biol Sci inline graphic position and 10 proteinogenic amino acids permuted on Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions. Among the 4400 peptides totally in the library, there are 20 x 20 =400 peptides with Int J Biol Sci inline graphic on Int J Biol Sci inline graphic position and 20 natural amino acids permuted on Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions. Due to the physicochemical properties of Int J Biol Sci inline graphic, the total number of experimental samples which can be used is 1300; the total number of sequences which regression model can predict is 4000. In this section, we complete three kinds of experiments. First, our method verifies on the 1300 known peptide motifs binding to seven distinct but highly homologous MMPs. Second, our method tests on 4000 peptide sequences to predict binding affinity values. Third, we identify peptides that preferentially bind to MMP-2 and MMP-7 over other MMPs.

Effectiveness of the regression model

To test the effectiveness of our method, we verify 1300 peptide motifs binding to seven MMPs respectively with Leave-one-out validation and two-fold cross-validation (to avoid overfitting) combining with 1500-tree random forest regression model. The Pearson-product-moment correlation coefficient (PCC) and the root mean squared error (RMSE) are used to evaluate performance:

Int J Biol Sci inline graphic  (5) 
Int J Biol Sci inline graphic  (6) 

where D contains all relevant binding motifs, Int J Biol Sci inline graphic is the average binding affinity, Int J Biol Sci inline graphic denotes experimental binding affinity value of the i-th peptide sequence, Int J Biol Sci inline graphic denotes the predicted affinity value of the i-th peptide sequence. An accurate predictor will get PCC=1, RMSE=0.

When employing Leave-one-out Validation, for each predicted peptide, we use 1299 peptide motifs with experimental binding affinity values as training data, removing the predicted one. When adopting the two-fold cross validation, we split the 1300 peptide motifs into two folds. We respectively use each fold as training set and the other fold as validation set. The validation results of identifying peptide motifs binding to MMPs are shown in Table 3. On all the seven MMPs isoform, our method achieved significant PCC and RMSE. The performance of 2-fold cross-validation is slightly lower than leave-one-out validation, but it is still satisfactory and can prove the effectiveness of our regression model.

Effectiveness of the Sampling Criteria

When adopting the sampling criteria, we only select relevant samples for building the predictor. The average number of relevant samples for each peptide motif is 1069, which means, for each predicted peptide, we use around 1069 samples as training set. Around 230 samples on average are irrelevant samples and were excluded by the sampling criteria. To test the effectiveness of the Sampling Criteria, we verify the 1300 peptide motifs binding to seven MMPs with 1500-tree random forest regression model respectively trained with the relevant samples and irrelevant samples. The validation results are shown in Table 4, which shows the effectiveness of our sampling criteria.

Comparison to Computational Methods

In this study, we use Random Forest as regression model, which gets a better result and costs less time compared with other techniques. The quantitative comparison with other techniques, such as Neural Network with one hidden layer and 100 nets, Lasso Regression, Kernel Ridge Regression are as shown in Table 5.

On the MMP-2 isoform, Random Forest has achieved overall PCC and RMSE values of 0.8212 and 17.7916; Lasso Regression has PCC and RMSE values of 0.5547 and 25.9458; Ridge Regression with Gaussian Kernel has PCC and RMSE values of 0.7240 and 21.5097; Neural Network with one hidden layer has RMSE values of 33.6099. For seven MMPs, our method using Random Forest outperformed other excellent regression techniques.

Comparison to Experimental Methods

We produce a position-specific scoring histogram [27] among the top 50 binding-value motifs against each individual MMP isoform to reflect specialty for each position as shown in Figure 2. For each MMP protein, we select its binding peptides with top 50 binding values predicted by our regression model. Then we analyze the frequency of appearance of each amino acid on each position among the top 50 predicted peptides of the specific MMP. The x axis denotes nominal positions of a binding peptide from Int J Biol Sci inline graphic to Int J Biol Sci inline graphic. The y axis and the height of a letter denotes its frequency of appearance on this position, implicating its contribution of binding value to the position. From the 1300 samples, for peptides binding adherently with MMP-2, Int J Biol Sci inline graphic Tyrically has conservative amino acids of Leu and Hpe (amino acid with an aromatic side chain of long-Phe), Int J Biol Sci inline graphic is Tyrical of amino acid Trp; for peptides binding adherently with MMP-7, Int J Biol Sci inline graphic Tyrically has a conservative amino acid of CPA3 with a unique hydrophobic side chain. Actually, as we can see from Figure 2, peptide motifs with Leu on Int J Biol Sci inline graphic position are conservatively with the high binding values with all seven MMPs, which conforms the significant pattern that inhibitors with high potency against MMPs family are highly homogenous.

In order to better visualize more detailed contributions of different positions, potencies from each of the Int J Biol Sci inline graphic, Int J Biol Sci inline graphic, Int J Biol Sci inline graphic side chains in the inhibitor library are averaged and graphically presented in Figure 3. Each picture represents one position of a MMP isoform. The y-axis denotes the average binding values of peptides which have the specific amino acid Tyre on this position. The x-axis denotes the appearance of amino acid Tyre on this position. We select the top 10 amino acids with highest mean binding values on Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions. We have identified top 3 amino acids with highest mean binding values on each position as shown in Table 6.

Our method is compared with the experimental method of Yao [7]. They also identified amino acids with highest averaged binding values as shown in Table 6. On Int J Biol Sci inline graphic position, they averaged all 1400 peptides to get 11 mean values of each amino acid Tyre. However, on Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions, they only averaged binding values of the 10 kinds of amino acids, which has permutated across 11 kinds on Int J Biol Sci inline graphic. So they got values of a relatively similar trend across 10 kinds of amino acids respectively within Int J Biol Sci inline graphic and Int J Biol Sci inline graphic. We averaged all 24 kinds of amino acids of 1400 samples in Figure 3. And as shown in Figure 3 and Table 6, on Int J Biol Sci inline graphic position, our computational results are consistent with the previous experimental works on MMPs binding peptide motifs, proving the reliability of our method. On Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions, our mean values of the 10 kinds of amino acids (Ala, Leu, Phe, Trp, Glu, Lys, His, Gln, Ser, Tyr) are consistent with experimental method, although some of which are omitted from Figure 3, due to their relatively low mean values. Our computational results also show that Gly is also a conservative amino acid on Int J Biol Sci inline graphic when Leu on Int J Biol Sci inline graphic position.

Prediction on 4000 peptide sequences

In the predicted library of 4000 peptides, we produce a position-specific scoring histogram among the top 100 binding-values motifs against each individual MMP isoform to reflect specialty for each position as shown in Figure 4. For each MMP protein, we select its binding peptides with top 100 binding values predicted by our regression model. Then we analyze the frequency of appearance of each amino acid on each position among the top 100 predicted peptides of the specific MMP. From the 4000 samples, for peptides binding adherently with MMP-2, Int J Biol Sci inline graphic Tyrically has conservative amino acids of Leu and Hpe, which is consistent with analyze result of 1300 samples; for peptides binding adherently with MMP-7, Int J Biol Sci inline graphic Tyrically has a conservative amino acid of Leu. Our predicted result of 4000 sequences manifests peptides with Hpe and Leu amino acids on Int J Biol Sci inline graphic conservatively have high binding values with seven MMPs. However, these peptides can't be used as inhibitors as they will also inhibit beneficial MMPs. So in the next part we will identify peptides with high selectivity.

Specificity of MMP-2 and MMP-7 binding peptide motifs

From the analysis of Figure 4, we can only identify which amino acid Tyre on each position has the highest potency against the MMP isoform, which are highly conserved. What would really benefit us is to identify peptides with not only high potency but also high selectivity, which bind coherently against specific MMPs, namely MMP-2 and MMP-7, while showing little binding values against other MMPs. So from the 4000 peptide motifs binding values results predicted by our computational method, we respectively identify peptides with selectivity to bind MMP-2 and MMP-7.

 Table 3 

Validation on 1300 peptide motifs using random forest with 1500 trees.

Leave-One-Out ValidationTwo-fold Cross-Validation
PCCRMSEPCCRMSE
MMP-20.821217.79160.783619.3735
MMP-30.768217.23790.711718.9166
MMP-70.91819.38270.905310.0559
MMP-80.891012.48450.875613.2831
MMP-90.912412.01890.889313.4283
MMP-130.870816.24110.844817.6808
MMP-140.724716.76570.691017.5881
 Table 4 

Validation on 1300 peptide motifs using random forest with 1500 trees with Sampling Criteria.

Training Set with Relevant SamplesTraining Set with Irrelevant Samples
PCCRMSERMSE
MMP-20.819517.860835.1576
MMP-30.768017.245729.8699
MMP-70.91819.380329.5143
MMP-80.890812.495233.2383
MMP-90.909512.209437.6389
MMP-130.868016.405244.0620
MMP-140.724616.767125.9060
 Table 5 

Validation of binding values of inhibitors of MMP-2 with distinct computational methods.

PCCRMSE
Lasso Regression0.554725.9458
Neural Network-33.6099
Ridge Regression with Gaussian Kernel0.724021.5097
Random Forest0.821217.7916
 Table 6 

Comparison with Experimental method of top average binding values on Int J Biol Sci inline graphic position

YaoOurs
MMP-2ZZ
MMP-3Sf, U, ZU, Z
MMP-7UU
MMP-8Z, LZ
MMP-9ZZ
MMP-13ZZ
MMP-14LL
 Figure 2 

Position-specific scoring histogram on top 50 binding-value motifs of 1300 samples against seven MMPs. For each MMP protein, we select its binding peptides with top 50 predicted binding values among 1300 library. Each bar represents the frequency of appearance of each amino acid Tyre on each position among the top 50 predicted binding peptides of the specific MMP. The x axis denotes nominal positions of a binding peptide from Int J Biol Sci inline graphic to Int J Biol Sci inline graphic. The y axis and the height of a letter denotes its frequency of appearance on this position, implicating its contribution of binding value to the position.

Int J Biol Sci Image (Click on the image to enlarge.)
 Figure 3 

Averaged inhibition contributions across permuted Int J Biol Sci inline graphic, Int J Biol Sci inline graphic and Int J Biol Sci inline graphic positions. Each bar represents averaged inhibition values of relevant residue across 1300-member library. The asterisk (*) highlights the residue contributing to the highest inhibition average in each graph.

Int J Biol Sci Image (Click on the image to enlarge.)

We filter peptides with high selectivity of MMP-2, which have binding values with MMP-2 higher than 60, and have binding values with other MMPs less than 20. We successfully obtain 5 inhibitor candidates of Asp-Ile-Phe, Asp-Ile-Tyr, Asp-Ile-Lys and Hser-Gly-Phe as shown in Table 7. Detailed binding values are shown in Table 8. We also filter peptides with high selectivity of MMP-7, which have binding values with MMP-7 higher than 60, and have binding values with other MMPs less than 30. We successfully obtain 6 inhibitor candidates of Chg-Ile-Ile, Chg-Ile-Leu, Chg-Ile-Glu, Chg-Ile-Met, Chg-Val-Ile and Chg-Val-Leu as shown in Table 7. Detailed binding values are shown in Table 9.

From the inhibitors with high selectivity of MMP-2 we find, only one peptide, Hser-Leu-His, labeled with an asterisk (*), is identified in the experimental method. Our computational method finds 4 novel peptides with high selectivity toward MMP-2, and 6 novel peptides with high selectivity toward MMP-7, a known target in pancreatic cancer and intestinal adenoma. As we can see in Table 7, our results of inhibitor candidates of MMP-2 confirms the conclusion of experimental methods that Int J Biol Sci inline graphic side chains containing Asp (D) and Hser (J) were found as inhibitors with strong selectivity to perturb MMP-2 [7]. For peptides with high selectivity toward MMP-7, Chg amino acid on Int J Biol Sci inline graphic position showed strong selectivity. On Int J Biol Sci inline graphic position, Val and Ile, which both have hydrophobic side chains of alkyls, also showed preference to bind to MMP-7. Our findings, which although is by no means exhaustive, facilitated the identification of inhibitors with good potency as well as desirable selectivity, providing significant insights of candidate inhibitor drugs.

 Table 7 

Inhibitors predicted by computational method with high potency and selectivity with MMP-2 and MMP-7

No.MMP-2MMP-7
1HSER-LEU-HIS *CHG-ILE-ILE
2ASP-ILE-PHECHG-ILE-LEU
3ASP-ILE-TYRCHG-ILE-GLU
4ASP-ILE-LYSCHG-ILE-MET
5HSER-GLY-PHECHG-VAL-ILE
6CHG-VAL-LEU
 Table 8 

The binding values against MMPs of inhibitors with high selectivity of MMP-2 predicted by computational method

MMP-2MMP-3MMP-7MMP-8MMP-9MMP-13MMP-14
Hser-Leu-His60.506719.784416.71741.221589.53571.60498.3165
Asp-Ile-Phe65.951516.532618.19900.3905415.20811.773711.3050
Asp-Ile-Tyr66.514418.081317.94880.1026415.78350.97488.8563
Asp-Ile-Lys61.498314.988317.07190.5157710.35623.117314.9769
Hser-Gly-Phe64.408817.806516.42882.2495013.12665.919613.8823
 Table 9 

The binding values against MMPs of inhibitors with high selectivity of MMP-7 predicted by computational method

MMP-2MMP-3MMP-7MMP-8MMP-9MMP-13MMP-14
Chg-Ile-Ile24.754621.609064.39358.010124.658121.928724.5430
Chg-Ile-Leu19.248818.261062.80146.165724.185424.052522.9100
Chg-Ile-Glu63.912927.241560.754218.056317.81812.3200825.0958
Chg-Ile-Met48.451429.658563.275015.324821.013211.353722.0503
Chg-Val-Ile22.866224.892961.12067.263419.414019.496521.4045
Chg-Val-Leu18.815223.207360.09805.543318.446121.366719.1914
 Figure 4 

Position-specific scoring histogram on top 100 binding-value motifs of 4000 samples against seven MMPs. For each MMP protein, we select its binding peptides with top 100 predicted binding values among 4000 library. Each bar represents the frequency of appearance of each amino acid Tyre on each position among the top 100 predicted binding peptides of the specific MMP. The x axis denotes nominal positions of a binding peptide from Int J Biol Sci inline graphic to Int J Biol Sci inline graphic. The y axis and the height of a letter denotes its frequency of appearance on this position, implicating its contribution of binding value to the position.

Int J Biol Sci Image (Click on the image to enlarge.)

Future Work

From the predicted binding values of our computational method, we identify 4 novel peptides with high selectivity toward MMP-2 of Asp-Ile-Phe, Asp-Ile-Tyr, Asp-Ile-Lys and Hser-Gly-Phe. We also identify 6 novel peptides with high selectivity toward MMP-7 of Chg-Ile-Ile, Chg-Ile-Leu, Chg-Ile-Glu, Chg-Ile-Met, Chg-Val-Ile and Chg-Val-Leu. Future work will be done to experimentally test the real binding values of these 10 inhibitors to verify its potency and selectivity.

Abbreviations

MMPs: matrix metalloproteases; PCC: pearson-product-moment correlation coefficient; RMSE: root mean squared error; ZBG: zinc binding groups; MW: molecular weight; SV: sum of atomic van der waals Volumes; SE: Sanderson electronegativity; P: polarizability; HB: number of hydrogen bonds; CSI: connectivity index; ECC: eccentricity; SPH: sphericity; HY: hydrophilic factor; ACC: auto-cross covariance.

Acknowledgements

This research and this article's publication expenditures are supported by a grant from the National Science Foundation of China (NSFC 61772362) and the Tianjin Research Program of Application Foundation and Advanced Technology (16JCQNJC00200)

Author Contributions

All of the authors listed made substantial contributions to the manuscript and qualify for authorship, and no authors have been omitted. Conception and design: Jian Song, Fei Guo; analysis and interpretation of data: Jian Song, Fei Guo; writing and revision of the manuscript: Jian Song, Fei Guo.

Competing Interests

The authors have declared that no competing interest exists.

References

1. Overall C M, Kleifeld O. Validating matrix metalloproteinases as drug targets and anti-targets for cancer therapy. Nature Reviews Cancer. 2006;6(3):227-239

2. Chen W, Ding H, Feng P. et al. iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget. 2016;7(13):16895

3. Overall C M, Kleifeld O. Towards third generation matrix metalloproteinase inhibitors for cancer therapy. British journal of cancer. 2006;94(7):941-946

4. Overall C M, Kleifeld O. Towards third generation matrix metalloproteinase inhibitors for cancer therapy. British journal of cancer. 2006;94(7):941-946

5. Cuniasse P, Devel L, Makaritis A. et al. Future challenges facing the development of specific active-site-directed synthetic inhibitors of MMPs. Biochimie. 2005;87(3):393-4020

6. Rao B G. Recent developments in the design of specific matrix metalloproteinase inhibitors aided by structural and computational studies. Current pharmaceutical design. 2005;11(3):295-322

7. Uttamchandani M, Wang J, Li J. et al. Inhibitor fingerprinting of matrix metalloproteases using a combinatorial peptide hydroxamate library. Journal of the American Chemical Society. 2007;129(25):7848-7858

8. Gfeller D, Michielin O, Zoete V. SwissSidechain: a molecular and structural database of non-natural sidechains. Nucleic acids research. 2012;41(D1):D327-D332

9. Guo Y, Yu L, Wen Z. et al. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic acids research. 2008;36(9):3025-3030

10. Mathura V S, Kolippakkam D. APDbase: Amino acid physicochemical properties database. Bioinformation. 2005;1(1):2

11. Wagner I, Musso H. New naturally occurring amino acids. Angewandte Chemie International Edition. 1983;22(11):816-828

12. Feng P, Ding H, Lin H. et al. AOD: the antioxidant protein database. Scientific reports. 2017;7(1):7449

13. Li Z, Tang J, Guo F. Identification of 14-3-3 proteins phosphopeptide-binding specificity using an affinity-based computational approach. PloS one. 2016;11(2):e0147467

14. Li Z, Tang J, Guo F. Learning from real imbalanced data of 14-3-3 proteins binding specificity. Neurocomputing. 2016;217:83-91

15. Wei L, Tang J, Zou Q. SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC genomics. 2017;18(7):1

16. Tang H, Zou P, Zhang C. et al. Identification of apolipoprotein using feature selection technique. Scientific reports. 2016:6

17. Lai H Y, Chen X X, Chen W. et al. Sequence-based predictive modeling to identify cancerlectins. Oncotarget. 2017;8(17):28169

18. Zeng X, Liao Y, Liu Y. et al. Prediction and validation of disease genes using HeteSim Scores. IEEE/ACM transactions on computational biology and bioinformatics. 2017;14(3):687-695

19. Tetko I V, Gasteiger J, Todeschini R. et al. Virtual computational chemistry laboratory-design and description. Journal of computer-aided molecular design. 2005;19(6):453-463

20. Cheng G, Li G, Xue H. et al. Zwitterionic carboxybetaine polymer surfaces and their resistance to long-term biofilm formation. Biomaterials. 2009;30(28):5234-5240

21. Guo Y, Yu L, Wen Z. et al. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic acids research. 2008;36(9):3025-3030

22. You Z H, Lei Y K, Zhu L. et al. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC bioinformatics. 2013;14(8):S10

23. Su W, Liao X, Lu Y. et al. Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework. Journal of Computational Biology. 2017

24. Aslam J A, Popa R A, Rivest R L. On Estimating the Size and Confidence of a Statistical Audit. EVT. 2007;7:8

25. Wei L, Xing P W, Su R. et al. CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency. Journal of Proteome Research. 2017;16(5):2044-2053

26. Sahu A, Runger G, Apley D. Image denoising with a multi-phase kernel principal component approach and an ensemble version. IEEE Applied Imagery Pattern Recognition Workshop. 2011:1-7

27. Crooks G E, Hon G, Chandonia J M. et al. WebLogo: a sequence logo generator. Genome research. 2004;14(6):1188-1190

Author contact

Corresponding address Corresponding author: Fei Guo: fguoedu.cn


Received 2017-12-27
Accepted 2018-2-28
Published 2018-5-22