## Thursday, October 20, 2016

### Prediction of amine pKa values of drug-like molecules using semiempirical QM methods - take 2

I screwed up some of the calculations describes in this post so I am starting fresh.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds.  Amines were especially challenging.  The study used small molecules.  The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups.  These are preliminary results and may contain errors.

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt.  I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules.  I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values.  For example, for phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine). Notice that the cyanoguanidine group in cimetidine has a pKa value of about 0 and is therefore deprotonated when the imidazole group titrates.  Eckert and Klamt characterised the histamine pKa value of 9.7 as an amidine pKa and the thenyldiamine pKa as a pyridine pKa. This is corrected to a primary amine and tertiary amine, respectively.  Also, the experimental pKa values of morphine and niacin are changed to 8.2 and 4.2, respectively, while the remaining experimental pKa values are taken from Eckert and Klamt.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS.  Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible.  I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on.  I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.

The Results
The results are shown in the figure above (based on "Table 1 soln").  The negative outlier seen for the COSMO-based method is cefradoxil and is due to proton transfer in the zwitterionic protonation state. Cefadroxil is also the negative outlier for DFTB3/SMD although the proton doesn't transfer. Proton transfer in zwitterions is also a common problem for DFT/continuum calculations and is due to deficiencies in the continuum solvent method, not the electronic structure method.  The good performance observed for PM3/SMD is thus due to fortuitous cancellation of error.  For the three other zwitterions among the molecules, niacin, phenylalanine, and tryptophan, no proton transfer is observed and the error is relatively small.

The AM1- and PM3-based methods perform best with RMSE values of 1.4-1.6 ± 0.3-0.4, that are statistically identical.* The null model has an RMSE of 1.8 ± 0.4 which, given the error bars, are statistically no worse than the AM1- and PM3-based method.

If the cefadroxil outlier is removed, the RMSE values for PM3/COSMO and AM1/COSMO drop to 1.0 ± 0.2 and 1.1 ± 0.3, while PM3/SMD and AM1/SMD are remain at 1.5 ± 0.4 and 1.6 ± 0.4.  So for this subset, the COSMO-based predictions can be said to outperform the SMD-based predictions, as well as the null model.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4?  Here PM3/COSMO is performs best by getting it right 94% of the time, compared to 91%, 77%, and 91% for AM1/COSMO, PM3/SMD, and the null model.

Summary and Outlook
Overall the best method for pKa prediction of drug-like molecules is either PM3/SMD, AM1/SMD, or the null model with RMSE values of 1.5 ± 0.4, 1.6 ± 0.4, and 1.8 ± 0.4.  The corresponding RMSE values for PM3/COSMO and AM1/COSMO are very similar, but in general they can fail for certain types of zwitterions.

For a set of molecules that does not include such zwitterions then the best method is either PM3/COSMO and AM1/COSMO which deliver RMSE values of 1.0 ± 0.2 and 1.1 ± 0.3, respectively.  In this case, using gas phase geometries lead to slightly larger , but statistically identical, RMSE values of 1.4 ± 0.3 and 1.3 ± 0.3, respectively.

In this study I made sure that the suitable reference molecules where available for all molecules.  This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules.  Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)

*The RMSE uncertainties are computed using equation 14 in this paper and the two methods are taken to be statistically different if their difference in RMSE is larger than the composite error described on page 105 of this paper.

## Monday, October 10, 2016

### Prediction of amine pKa values of drug-like molecules using semiempirical QM methods

2016.10.16 update: I used the wrong keyword in the MOPAC calculations so these numbers are not right.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds.  Amines were especially challenging.  The study used small molecules.  The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups.  These are preliminary results and may contain errors.

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt.  I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules.  I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values.  For example, for Phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine).  When preparing the protonation states I noticed that Eckert and Klamt mischaracterised the Histamine pKa value of 9.7 as an amidine pKa. This is corrected to a primary amine.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS.  Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible.  I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on.  I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.

The Results
You can see the results here (Table 1). The COSMO-based predictions are very bad with MAD values of 13.1 - 15.4 pH units and the SMD-based predictions are OK with MAD values of 1.5 - 2.0 pH units. If I simply use the reference pKa values the MAD is 1.4 pH units.  The corresponding maximum ADs are 71.3 - 136.1, 7.3 - 9.9, and 6.5 pH units.

Inspection of the molecules with large AM1/SMD and PM3/SMD errors suggests that the two of the molecules (Cimetidine, Niacin) were prepared with incorrect protonation states.  The ACE JChem pKa predictor predicts that the pKa of the carboxyl group in Niacin is higher than that of the pyridine, while the pKa value of the nitrile-substituted guanidine group in Cimetidine is lower than at that of the imididazole. Similarly, the ACE JChem pKa predictor predicts that titrating amine in Thenyldiamine with pKa 8.9 is not the pyridine as indicated by Eckert and Klamt, but the tertiary amine.

After making these changes the MADs are 10.4 - 12.4, 1.3 - 1.7, and 1.4 pH values and the max ADs are 71.3 - 136.1, 3.3 - 9.9, and 6.5 pH units.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4?  The COSMO-based predictions get this right 74 - 77% of the time, while the SMD-predictions get it right 75 - 87% of the time.  In this regard PM3/SMD is considerably worse than AM1/SMD and DFTB3/SMD despite the fact that the MAD for PM3/SMD is 0.1 pH units lower than AM1/SMD and 0.4 pH units lower than DFTB/SMD. Simply using the reference values gets the protonation state right 91% of the time.

Why are COSMO-based predictions so bad?
The largest error occurs for Cefadroxil where the proton transfers in the zwitterionic state for the COSMO method.  The remaining large errors involve molecules that have a +2 charge where the pKa is much too low.  It appears that the COSMO method severely underestimates the solvation energy of +2 ions.

Outlook
In this study I made sure that the suitable reference molecules where available for all molecules.  This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules.  Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)

## Sunday, September 18, 2016

### Why is there no standard state temperature?

The standard state pressure is 1 bar, why is there no standard temperature?

The standard state pressure is not an experimental condition, while the temperature is.

The main reason the standard state is defined is because it leads to this very useful equation
$$K_p = e^{-\Delta G^\circ/RT}$$
Say you have this reaction: $A \rightleftharpoons B + C$ One way to use this equation is to compute the free energy of 1 mol of $A$, $B$, and $C$ at 1 bar using equations derived for an ideal gas, compute $\Delta G^\circ = G^\circ (B) + G^\circ (C) - G^\circ (A)$, and use that value to predict $K_p$.
If the gasses behave like ideal gasses "in real life" then the measured  $K_p$ will match the $K_p$ computed from $\Delta G^\circ$.  You can do the measurement at any pressure you want, not just at 1 bar.* The standard state refers to the pressure you use when computing $\Delta G^\circ$.  The only thing it has to do with the experimental measurement is that it defines the units you should use for your partial pressures when computing $K_p$

$\Delta G^\circ$ does also depend on temperature, but the temperature you chose should be the same as the experimental conditions.  So the temperature is not part of the standard state definition.

But what about "Standard temperature and pressure (STP)?"
Standard temperature and pressure (STP) refers to STP conditions under which $K_p$ is measured, not the pressure used to compute $\Delta G^\circ$. I know, they couldn't have made it more confusing if they tried when they named these things.

*Of course if you do the measurements at very high pressures or low temperatures, then the assumption that the gasses behave ideally will be less valid and the measured $K_p$ will differ more from the $K_p$ computed from Equation 1.  However, that is a separate issue unrelated to the standard state because the $K_p$ in Equation 1 refers to the $K_p$ you would measure if the gasses behaved ideally at the pressure and temperature used in the experiment.

## Saturday, September 17, 2016

### Why I tweet and blog

Update: here is the audio.  Something wen't wrong with Google Hangouts so the slides are missing. Despite the fact that I did a few practice runs yesterday ... and have a PhD in theoretical quantum chemistry.  WTH Google!

Update 2: the trick (in addition to this tip) is to start the Powerpoint slideshow before you start streaming.

On Tuesday I am giving presentations on tweeting and blogging in a Scientific Writing course.  Here are my slides and the message to the students

Dear Scientific Writing students

On Tuesday I will give two presentations: one on tweeting and one on blogging.  You can find the slides below.

You'll also do some writing so please bring a laptop and make sure you can get on Eduroam.

In preparation for Tuesday, please find one science related blogpost and twitter account you think looks interesting and share them on the discussion forum I created on Absalon.

Finally, I may try to live broadcast my talk using Google's Hangout On Air.  I've never tried it, so I am not sure if I can get it to work by Tuesday.  If you are uncomfortable with this, just send me an email and I won't do it.

See you Tuesday!

Slides

## Monday, August 22, 2016

### Finding the reference molecule for a pKa calculation using RDKit

This is prototype code related to this project.  I use Histamine as an example which has two ionizable sites: a primary amine and an imidazole ring.

The code figures out that imidazole is the best reference for the imidazole ring, while ethylamine is the best reference for the primary amine.  The code does this by figuring out which atom is being deprotonated, computes the Morgan fingerprint around this atom, and compares it to the Morgan fingerprints of imizadole and ethylamine.

## Thursday, August 11, 2016

### Drug design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ. It's ultimately related to making better drugs so first some background.

Background
Designing new drugs currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new drugs - a cost that is ultimately passed on to you and I as consumers.  There are many, many reasons why drug design is so difficult. One of them is that we often don't know fundamental properties of drug-candidates such as the charge of the molecule at a given pH. Obviously, it is hard to figure out whether or how a drug-candidate interact with the body if you don't even know whether it is postive, negative or neutral.

It is not too difficult to measure the charge at a given pH, but modern day drug design involves the screening of hundreds of thousands of molecules and it is simply not feasible to measure them all. Besides, you have to make the molecules to do the measurement, which may be a waster of time if it turn out to have the wrong charge. There are several computer programs that can predict the charge at a given pH very quickly but they have been known to fail quite badly from time to time.  The main problem it that these programs rely on a database of experimental data and if the molecule of interest doesn't resemble anything in the database this approach will fail. The paper that just got published is a first step towards coming up with an alternative.

The New Study
We present a "new" method for predicting the charge of a molecule that relies less on experimental data but it fast enough to be of practical use in drug design. The paper shows that the basic approach works reasonably well for small prototypical molecules and we even test one drug-like molecule where one of the commercial programs fail and show that our new method performs better (but not great).  However, we have to test this new method for a lot more molecules and in order to do this we need to automate the prediction process, which currently requires some "manual" labor, so this is what we're working on now.