Free software for pka calculation
Do you have a large curated set of experimentally measured pK a values? Our development team is happy to collaborate with you to expand the applicability domain of our algorithms. Software installations for individual computers with a graphical user interface.
Full physicochemical, ADME and toxicity calculator modules are available with training capabilities including the PhysChem Profiler bundle. Screen tens of thousands of compounds with minimal user intervention—compatible with Microsoft Windows and Linux operating systems OS. Plug-in to corporate intranets or workflow tools such as Pipeline Pilot. KNIME integration components available. This application note discusses the importance of using the machine learning model training capabilities of predictive models to improve accuracy.
Read more. The acid dissociation constant K a , also known as the acid-ionization constant is a quantitative measure of the strength of an acid in solution i. A strong acid will completely dissociate in water equilibrium favors the right hand of the equation below , while weak acids will not.
K a is the equilibrium constant that describes the dissociation of a molecule and is expressed as a ratio of concentrations of the various species present. The acid ionization constant K a varies by orders of magnitude. Then the selected descriptors were applied to an SVM classifier as well as a k-nearest neighbors kNN approach based on the majority vote of the nearest neighbors in order to fit a classification model.
The global AD is a Boolean index based on the leverage approach for the whole training set, while the local AD is a continuous index with a range from zero to one based on the most similar chemical structures from the training set [ 46 ]. Since binary fingerprints were employed to build the predictive models, the Jaccard—Tanimoto dissimilarity index was used as the distance metric to assess the AD and accuracy estimates.
The continuous molecular descriptors, as well as the binary fingerprints and fragment counts, were generated using version 2. The LibSVM3. Gradient boosting is a machine learning technique for regression and classification problems. It produces a prediction model that represents a compilation of weak prediction models, typically decision trees. Gradient boosting builds the weak models in a stage-wise fashion and generalizes them by allowing optimization of an arbitrary differentiable loss function.
XGB is an extension of gradient boosting that prevents overfitting by using an improved cost function [ 48 , 49 , 50 ]. Importantly, the caret implementation performs model tuning and calculates variable importance [ 52 , 53 ]. R version 3. Root-mean-squared error RMSE was optimized using the training data with fivefold cross validation repeated five times. The acidic and basic data sets were modeled separately.
Each of the three data sets Options 1—3 was examined and performance was assessed for the testing data sets using RMSE and the coefficient of determination R 2. In addition, three feature-reduction techniques were examined to assess impact on model performance of using: 1 data in which features columns of all zeros and all ones were deleted, 2 as previous but with highly correlated features removed, and 3 as previous but with low-variance features removed.
The RData file can be loaded into the R workspace to quickly access all models and variables. The RData environment and performance metrics are found on [ 54 ]. DNN learning has been used extensively in computational biology [ 55 , 56 , 57 ] and computational chemistry [ 58 , 59 , 60 ].
A DNN learning model consists of artificial neural networks with multiple layers between the input and the output. One significant advantage of using DNN learning is that it maximizes the model accuracy by mapping features through a series of nonlinear functions that are stitched together in a combinatorial fashion. The DNN learning models were built using the open-source deep learning libraries Keras 2.
The open source Scikit-learn Python library was used for feature vector processing, fivefold cross validation, and final metric computations [ 63 ]. Python 3. Fivefold cross validation was used to construct a model from the training data by optimizing RMSE. A variety of parameters were examined and optimized, including the algorithm, weight initialization, hidden layers activation function, L2 regularization, dropout regularization, number of hidden layers, nodes in the hidden layers, and the learning rate.
All feature vectors with continuous variables were scaled to absolute values of minimum and maximum values prior to training. The final tuned model had three hidden layers of nodes each followed by a batch normalization and a dropout layer 0. The overall architecture is shown in Fig. DNN learning model for pKa prediction. The model was comprised of a four-layer neural network with one input layer K features , three hidden layers nodes each and one output layer pKa value.
Each hidden layer was followed by a batch normalization layer and a dropout layer not shown. Connections existed between neurons across layers, but not within a layer. To further validate the three models and assess their predictivity, a large external data set that was not used during the modeling process would be ideal. However, no large, well-annotated pKa datasets were found in the literature. Thus, in lieu of experimental data, the possibility of benchmarking the models using predictions that could be verified to be consistent with DataWarrior was tested.
For ChemAxon, the strongest acidic and basic pKa values were considered. This tested the hypothesis that predictions generated by the two commercial tools were concordant enough either separately or in combination with the experimental DataWarrior data set to be used as benchmarks for the three models. The concordance metrics were the number of chemicals commonly predicted to have acidic and basic pKas as well as the statistical parameters: R 2 , coefficient of correlation r 2 , and RMSE.
This concordance analysis used data Option 3, which includes amphoteric chemicals, mean pKa values for replicates, and the strongest acidic pKa smallest value or strongest basic pKa greatest value. This concordance analysis had two main goals. All predictions in this analysis were based on QSAR-ready structures generated using the previously mentioned structure standardization procedure.
The above described datasets from Options 1—3 were modeled using the SVM algorithm, and the results are shown in Table 1. The acidic and basic datasets were modeled separately using continuous descriptors, binary fingerprints, fragment counts, and combined binary fingerprints-fragment counts.
The acidic dataset from Option 1 with fingerprints and fragment counts showed the best performance on the test set, with an R 2 of 0. In general, the basic pKa models performed better than the acidic pKa models for the three data options. Since the pKa value prediction should be combined with a decision algorithm to decide whether to use the acid or basic model or both, the classification modeling described above was used.
First the GA identified 15 continuous descriptors of relevance in differentiating acidic, basic, and amphoteric chemicals Table 2. Then, these descriptors were used to calibrate a three-class kNN categorical model. In order to challenge the kNN model based on the 15 GA selected continuous descriptors, its performance was compared to SVM models based on the same descriptors as well as fingerprints and fragment counts. The results, summarized in Table 3 , confirmed that the kNN model based on the best 15 descriptors is more robust and stable in comparison to the other models.
The kNN classification model based on the 15 descriptors selected by GA is used to select the appropriate SVM model, which is then applied to predict the pKa values. The OPERA pKa predictor is also equipped with an ionization checker based on the hydrogen donor and acceptor sites such that pKa predictions will only be made for ionizable chemicals.
Three feature-reduction techniques were applied to the binary fingerprints and fragment counts descriptors:. Model performance and variable importance for all feature sets is available in Additional file 2. The performance for the five best models for the acidic and basic data sets is summarized in Table 4.
The models for the best acidic and basic data sets had equivalent performance, with RMSEs of 1. In addition to modeling all eight binary fingerprints separately, another data set was created that combined the eight binary fingerprints. The best performance was obtained with the combined fingerprints. This is not surprising because the combined fingerprint data set allows the most informative features of any binary fingerprint to be used in the model. Variable importance plots and observed vs.
The R workspace environment was saved for all models so the code does not have to be executed to examine the models. The user can simply load the R workspace into the current session. The results in Table 5 show that the model for chemicals with a single acidic pKa had the best performance, followed by chemicals with a single basic pKa and finally by chemicals with a single acidic and single basic pKa combined.
Performance was measured using the RMSE for the test data. Models using data Options 1 and 2 outperformed models using data Option 3. In all cases, models constructed using a combination of features outperformed models using a single fingerprint set. It is not clear why the DNN model for chemicals with an acidic pKa performed so well, as DNNs are notoriously difficult to interpret [ 64 ]. While DNNs have shown remarkable performance in many areas, in many cases they remain a black box [ 65 ].
For example, in our relatively small data set, there were , trainable parameters, which illustrates the complexity of many DNN models. One important difference among the models is that the SVM models are coupled with a categorical model that can indicate whether a molecule has an acidic pKa, basic pKa or both amphoteric.
This leads to an automatic selection of the model to use acidic, basic or both , for ionizable chemicals only, by OPERA models. The entire DataWarrior list Option 3 was used as input for the two commercial tools to predict whether a chemical would have an acidic or basic pKa as well as to predict numeric pKa values.
These tools can also provide multiple acidic and basic pKa values for a single chemical. The predictions of both tools are provided in Additional file 4. Thus, the goal was not to assess the predictive performance of the commercial tools.
Table 7 summarizes the total number of chemicals that were predicted to have acidic or basic pKas by the two commercial tools using the DataWarrior chemicals Option 3. As shown in Table 7 , the commercial tools provided pKa values for the overwhelming majority of the DataWarrior chemicals. Only 3. These numbers are substantially higher than the number of acidic and basic pKa values available from DataWarrior. The summary data presented in Table 7 suggests that the two commercial tools are employing different algorithms to determine ionization sites and to classify the pKa of the chemicals as acidic and basic.
However, the two tools also show a high number of chemicals predicted in both the acidic and basic categories third row of Table 7. The results of this analysis are shown in Table 8 and Fig.
However, as mentioned above, it is important to note that the two commercial tools predict a higher number of amphoteric chemicals than was indicated by the DataWarrior experimental data. Figure 5 plots the pKa predictions of the two commercial tools in comparison to the DataWarrior acidic and basic pKa data sets for the chemicals in common Table 8. The concordance statistics of the predictions of those chemicals are also provided in the figure inserts as R 2 , r 2 , and RMSE.
The data show moderate r 2 correlations 0. However, Fig. In fact, the dotted lines in the Fig. This is confirmed in Table 8 , which also shows that the two commercial tools show high concordance with DataWarrior in terms of the number of predictions within 2 pKa units error. This means, that for the most part, the two predictors are reasonably concordant based on the 2 pKa units cutoff with each other as well as with DataWarrior, as shown in Fig.
Thus, it seems that the differences between the two programs is multifaceted, with potential sources of variation for both commercial tools and DataWarrior including the prediction algorithms, data sources, and curation processes. This is valid for both acidic and basic pKas. Figure 7 shows good concordance between the averaged predictions and the acidic and basic pKa values of DataWarrior. The DataWarrior data set included of these chemicals, so these were removed, leaving chemicals for further analysis.
The total number of predicted chemicals by the two commercial tools and the overlap between them are summarized in Table All predictions for this dataset are provided in Additional file 5. This divergence was greatest for the chemicals with a basic pKa. Neither the XGB or DNN models predicted if a chemical would have an acidic or basic pKa, as shown in Table 11 , so all chemicals were predicted using both the acidic and basic models.
Figure 9 shows a reasonable concordance between the three models and the two benchmark datasets. However, the concordance with the basic benchmark data set was higher than the acidic dataset. However, for the benchmark datasets, which includes only predictions within 2 pKa units of each other, the opposite was noticed, namely that the basic dataset showed better concordance with the OPERA, XGB and DNN models.
As shown in Tables 7 and 8 , the number of overlapping predictions between the two tools was higher than the number of pKa values in DataWarrior, although not all DataWarrior acidic and basic datasets were predicted as such by the two tools. As expected, concordance for the predictions outside of the AD was much lower than that for predictions inside the AD. Thus, as would be expected, excluding the predictions outside of the AD improved the statistics of the models since the predictions within the AD can be considered more accurate than those outside the AD.
The other reason for the lower concordance between the models developed in this work and the benchmark dataset is due to the high number of discordant predictions at both extremes of the benchmark acidic pKa predictions Fig.
As seen in Fig. The pKa range where these two tools are the most concordant is [0—14], which is also the range for most of the DataWarrior acidic pKa values Fig.
Thus, the benchmark acidic dataset can be reduced to the range of DataWarrior acidic pKa values [0—14] that was used to train the three models developed in this work. By excluding the extreme acidic pKa values, the benchmark dataset was reduced from to chemicals. The resulting basic benchmark dataset was reduced from to chemicals.
The concordance statistics between the three models and the reduced benchmark datasets are summarized in Table As expected, by excluding the extreme values that are the source of divergence between the commercial tools and are absent in DataWarrior, the overall concordance between the benchmark datasets and the three models increased.
This increase is clear for the acidic dataset after removing the extreme pKa values, while only 42 pKa values were removed from the basic dataset. This explains why the chemicals outside of the AD had lower concordance with the benchmark dataset. Removing the extreme values from the acidic benchmark dataset also decreased the difference in RMSE between the three models with the benchmark dataset.
This benchmark analysis and comparison revealed many differences among all models with respect to the predictions of the pKa values and how chemicals are predicted to have an acidic or basic pKa.
Differences were noted among the models developed in this work as well as between the commercial tools, and this applied to both analyses based on the DataWarrior and the benchmark dataset.
Thus, while OPERA can be applied directly to large numbers of chemicals to identify the ionizables then predict the relative acidic and basic pKas in batch mode, the DNN and XGB models provide the users with the flexibility to manually select ionizable chemicals, applying expert judgment if dealing with a limited number of chemicals, or to plug in external ionization algorithms.
Since the three resulting models from this work are QSAR models trained on a dataset with only the strongest acidic and basic pKas, they do not provide pKas for all ionization sites for multiprotic compounds. All OPERA predictions are provided with AD and accuracy estimates as well as experimental and predicted values for the nearest neighboring chemicals as shown on the EPA Dashboard prediction reports and explained in Mansouri et al.
The app is optimised for iPad and contains calculator functions designed to ease the process of calculating values of: Cheng-Prusoff; Dose to man; Gibbs free energy to binding constant; Maximum absorbable dose calculator; Potency shift due to plasma protein binding.
Calculating Physiochemical Properties There are a number of online websites that provide property calculations, however be careful not to post proprietary information. The Chemaxon website A variety of calculation plugins are available for Marvin.
Karickhoff and L. Physical properties, Major references:S. Hilal, L. Carreira and S. H Hilal, L. Hydrolysis Major references:S. H Hilal, S. Karickhoff, L. Carreira and B. Whiteside, S.
H Hilal, and L. Carreira,"Estimation of phosphate ester hydrolysis rate constants I. L Bornander, L.
0コメント