Friday, May 6, 2016

A cheminformatics problem: protonate with SMILES and InChI

I am teaching a python programming course and this is one of the projects I want to try this year.

The over all goal of the project is to write one or more programs that generates protonation states for a list of nitrogen containing molecules specified by name. The project uses SMILES, and maybe InChI, which you can read more about here.

Getting started
1. Write code that from this list molecules = ["CCN", "CNC", "CN(C)C"] generates this output

CCN
CC[NH3+]

C[N@H]C
C[N@@H]C
C[NH2+]C

C[N@](C)C
C[N@@](C)C
C[N@H+](C)C
C[N@@H+](C)C

2. Write code that from this list  molecules = ["C(C(=O)O)N"] generates this output (the order is not important)

C(C(=O)O)N
C(C(=O)O)[NH3+]
C(C(=O)[O-])N
C(C(=O)[O-])[NH3+]

3. C(C(=O)O)N is the amino acid glycine.  Extend this program to work for alanine, asparagine, aspartate, and lysine.  Use this site to get SMILES strings for these amino acids. Find a picture of asparagine and make sure you're treating the side-chain correctly.

The project
4. (optional) Figure out how to generate a file containing SMILES strings from a file containing names. The best way is probably bash.  Get inspiration here, here, and here.

5. Generate all possible protonation state SMILES for the molecules in Table 2 in this paper. If you completed step 4 you can use tools like https://pdftables.com/ to generate a file with the names.

6. Repeat for Table 1 and 3

7. (optional) The neutral form of the amino acids histidine and arginine side chain groups have tautomers. Generate SMILES for all tautomers (InChI might help you identify tautomers).

8. (optional) Do any of the molecules in step 5 and 6 have tautomers? If so generate SMILES for all tautomers.

Some code snippets to get you started



This work is licensed under a Creative Commons Attribution 4.0

Tuesday, May 3, 2016

Enzyme design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ.  It's ultimately related to making better enzymes so first some background.

Background
Enzymes are proteins that make certain chemical reactions go faster and nearly every complex molecule in your body is made by, or broken down by, enzymes.  But people have also started using enzymes in commercial products, for example in washing powder to break down oily stains at lower temperatures. This saves money on heating the water and, being proteins, the enzymes are biodegradable. So there is a lot of interest in designing new enzymes that build or break down new molecules efficiently. For example, there is a rather large company (Novozymes) near Copenhagen that does nothing but design, produce, and sell enzymes on an industrial scale.

Designing new enzymes currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new enzymes - a cost that is ultimately passed on to you and I as consumers. My long-term goal is to reduce the amount of trial-and-error by writing a computer program that can predict what changes you have to make to improve the enzyme before you ever make it in the lab.

I've had some modest success with a prototype program some years ago (you can find the papers here and here) for one enzyme.  But one of the many things we don't know is whether the method we base our approach on will work at all for other types of enzymes. The paper that just got published is a first small step in figuring this out.

The New Study
We've collected data for five other enzymes from other published papers that we trust reasonably well and tested two methods that are fast enough to design enzymes - one is the same method we used a few years ago and the other is a newer one that wasn't available to us before now.  The conclusion of our study is that the methods seem to work well enough for all but one system, and this system is different for the two methods.  This suggests that we can't just base future work on one method. We have to have both ready in case one of them fail.  We need to repeat the study for many other types of enzymes - I would say at least 15-20 more - and we need to improve the quality of the data so that we trust it completely, rather than "reasonably well".  In the paper we have extended an open invitation to other scientists to contribute to this effort.




This work is licensed under a Creative Commons Attribution 4.0 

Saturday, April 16, 2016

Computing pKa values for molecules with several ionizable groups

We're working on pKa prediction using semiempirical methods and need to compute pKa values  for molecules with several ionizable groups. Here are my current thoughts so far.

Background: one ionizable group
If there is only one tritrateable site
$$ \mathrm{BH \rightleftharpoons B + H^+} \ \ \  K=\mathrm{\frac{[B][H^+]}{[BH]}} $$
then the fraction of $\mathrm{BH}$ molecules $f_{\mathrm{BH}}$ is $$
\begin{split}
 f_{\mathrm{BH}} & =\mathrm{\frac{[BH]}{[B]+[BH]} } \\
& = \mathrm{\frac{[B]}{[B]}\frac{[BH]/[B]}{1+[BH]/[B]} } \\
& = \mathrm{\frac{[H^+]/K}{1+[H^+]/K} } \\
& = \mathrm{\frac{10^{p\textit{K}-pH}}{1+10^{p\textit{K}-pH}} }
\end{split}
$$ where $$ \mathrm{pH = -log[H^+] \implies [H^+] = 10^{-pH}} $$
and similarly for $K$.

From this we can see that the pK value is the pH value for which $f_{\mathrm{BH}}$ = 1/2. So you compute the pK value from the standard free energy difference
$$\text{p}K =\left( \Delta G^\circ(\mathrm{B})+\Delta G^\circ(\mathrm{H^+})- \Delta G^\circ(\mathrm{BH})\right)/RT\ln(10) $$
and you're done.

Two ionizable groups
For a molecule with two titrateable groups ($ \mathrm{HB_\alpha B_\beta H}$) and the following equilibrium constants $$ \mathrm{HBBH \rightleftharpoons BBH + H^+} \ \ K_{\alpha1}$$ $$\mathrm{HBBH \rightleftharpoons HBB + H^+} \ \ K_{\beta1}$$ $$ \mathrm{HBB \rightleftharpoons BB + H^+} \ \ K_{\alpha0}$$ $$\mathrm{BBH \rightleftharpoons BB + H^+} \ \ K_{\beta0}$$ The probability of, for example, $\mathrm{BBH}$ is $$ f_{\mathrm{BBH}} =\mathrm{\frac{[BBH]}{[BB]+[BBH]+[HBB]+[HBBH]}= \frac{[BBH]}{\textit{P}}} $$ $f_{\mathrm{BBH}}$ can be rewritten in terms of pK values $$f_{\mathrm{BBH}} = \mathrm{\frac{[BBH]/[BB]}{\textit{P}/[BB]} = \frac{10^{p\textit{K}_{\beta0}-pH}}{\textit{P}/[BB]}} $$ where $$ \mathrm{ \textit{P}/[BB] = 1+10^{p\textit{K}_{\alpha0}-pH}+10^{p\textit{K}_{\beta0}-pH}+ 10^{p\textit{K}_{\alpha0}+p\textit{K}_{\beta1}-2pH}}  $$ Similarly, $$ f_{\mathrm{HBB}} = \mathrm{\frac{10^{p\textit{K}_{\alpha0}-pH}}{\textit{P}/[BB]}} $$ and $$ f_{\mathrm{HBBH}} = \mathrm{\frac{10^{p\textit{K}_{\alpha0}+p\textit{K}_{\beta1}-2pH}}{\textit{P}/[BB]}} $$  The apparent pK value of the $\alpha$ group ($\mathrm{p}K_{\alpha}$) is the pH at which its protonation probability $$f_\alpha =f_{\mathrm{HBB}} + f_{\mathrm{HBBH}} $$ is 1/2 and similarly for the $\beta$ group.  So compute the microscopic pK values (Eq 4-7), then $f_\alpha$ and $f_\beta$ as a function of pH, and then $\mathrm{p}K_{\alpha}$ and $\mathrm{p}K_{\beta}$

If one of the groups (say $\alpha$) titrates at a significantly lower pH than the other ($\mathrm{p}K_{\alpha1} << \mathrm{p}K_{\beta1}$) then $\mathrm{p}K_{\alpha}=\mathrm{p}K_{\alpha1}$ and $\mathrm{p}K_{\beta}=\mathrm{p}K_{\beta0}$ and it is not necessary to compute the free energy of $\mathrm{HBB}$, but it can be hard to determine this in advance.  Similarly, if there is no significant interaction between the sites then $\mathrm{p}K_{\alpha}=\mathrm{p}K_{\alpha1}=\mathrm{p}K_{\alpha0}$ and $\mathrm{p}K_{\beta}=\mathrm{p}K_{\beta1}=\mathrm{p}K_{\beta0}$ and one can skip one of the protonation states.

For $N$ ionizable groups one has to determine $2^N$ microscopic pKa values, which quickly gets out of hand if one has to do a conformational search for each protonation state and the molecule is large.

Related post
Generating protonation states and conformations


This work is licensed under a Creative Commons Attribution 4.0

Wednesday, April 13, 2016

Reviewing for PeerJ: it's the little (and the not so little) things

I just did my first review for PeerJ and it was a real pleasure because there are a lot of "little things" that make your reviewing life easier:

1. Figures/tables are in the text and, get this, the captions are immediately above/below the corresponding figure/table.  Some other journals also do this, but not enough.

2. I annotate the mss in a pdf reader and usually this is a frustrating experience since the publisher generated pdf has all sorts of "quirks" that make highlighting and copying text hit and miss.  The previous pdf I reviewed turned every page with a figure into an image!  Annotating/copying in the PeerJ pdf worked flawlessly.

3. The pdf contained a 3 front pages with the due date, a summary of the review criteria, a link to the page with the supplementary material, and a link to the page where I should submit my review.  No hunting around for the email with the link! I teared up a little bit when I saw that.

Other "little things" include stuff like not having to rank the perceived importance or impact of the work on some bogus 1-10 scale, a strict policy on making the raw data available, and a button to click to make my review non-anonymous.




This work is licensed under a Creative Commons Attribution 4.0

Thursday, April 7, 2016

Why I use Twitter: one scientist's perspective



Tomorrow I am giving a short talk on why and how I use Twitter to public relations people at the University.  I have 20 min + 10  min questions, but am aiming for a 10 min talk + 20 min questions.

Comments welcome




This work is licensed under a Creative Commons Attribution 4.0

Tuesday, April 5, 2016

ACS Omega is too expensive

Disclaimer: I applied to be a co-editor of this journal and was not selected.

ACS Omega has just announced its APCs: \$1500 (\$2000) under the ACS Authors Choice (CC-BY or CC-BY-NC-ND) license for members and an additional \$500 for non-members (since full membership costs $162, this is the real additional cost for one yearly publication).

The ACS Authors Choice license is not open access: under the ACS Authors Choice license you assign copyright to the ACS.  While
For non-commercial research and education purposes only, users may access, download, copy, display and redistribute articles as well as adapt, translate, text and data mine content contained in articles, ... 
you still can't, for example, use a figure from such an article in a book chapter without the ACS permission.  Also, the ACS can, for example, sell your article or your figures.

So the cost for to publish OA is ACS Omega is $2000.  That's more expensive than other impact neutral OA journals: PLoS One (\$1495), Scientific Reports (\$1495), F1000Research (\$1000), Rio Journal (\$850), PeerJ (\$100/author, Bio-only), and Royal Society Open Science ($0, for now).

Since all journals are impact neutral and provide quick review (AFAIK) price is the main considerations and I see no reason to pay more for to publish in ACS Omega.




This work is licensed under a Creative Commons Attribution 4.0

Thursday, March 31, 2016

Reviews for Towards a barrier height benchmark set for biologically relevant systems

2016.04.06: our rebuttal can be found here

The reviews our latest PeerJ submission (editor assigned February 25) came back last evening with the alway welcome verdict "minor revision".  The preprint has already received over 400 views and 85 downloads.

Reviewer Comments

Reviewer 1 (Xabier Lopez)

Basic reporting
No Comments

Experimental design
No Comments

Validity of the findings
No COmments

Comments for the author
The present papers is a first step towards the generation of a reaction database for biological systems. To do so the authors provide with highly accurate DFT and ab-initio structural and thermodynamic data for a set of five enzyme reactions. The results are compared with various PM# type semi-empirical methods and DFTB3. The work sets up an standard in terms of semi-empirical method validation, that hopefully, it will cristalize in a joint effort with the rest of the theoretical chemist community to create a database of biologically relevant reactions. In this sense, I find this paper with a high potential and should be publised.

Although the set of reactions is small as to draw general conclusions, I understand that the goal of the authors is to set up an standard so that other theoretical chemistry groups worldwide will contribute to this joint effort. I think that this is a very good and necessary idea. I would suggest that it would be useful in the github.com/jensengroup/db-enzymes dataset that the authors would consider to provide with standard inputs for the programs that should be considered as standard for other theoreticians as well. In this way, the desirable growth of this dataset by contributions from other groups will not jeopardize its necessary coherency.

Reviewer 2 (Anonymous)
Basic reporting
The authors present a compilation of model systems for five, real-life enzymatic reactions based on previously published structures, reaction barrier heights and reaction energies that had been obtained with the B3LYP density functional theory (DFT) approximation. These DFT structures and numbers are used as a reference to examine various cost-efficient semi-empirical methods. The authors consider both geometry optimisations, as well as single-point energy calculations. Furthermore, they also comment on the effect of a solvation model on the results.

The main motivation of this study is clearly outlined and it is important to note that the authors point out that their study is only the first step in a long-term undertaking that aims at having a diverse set with more accurate reference energies. The language is clear and professional. Raw data is supplied through a link to figshare and can be easily accessed and examined by the readers of this paper. With regards to cited literature and the figures, I have the following suggestions to make:

a) I think the cited literature should be more comprehensive and also take into consideration recent QM studies on peptides and proteins that allow a better understanding of the quality of the herein used DFT reference values; I will elaborate on this point more in detail under “validity of the findings”. I also noticed that some cited articles lack a volume and page number; I therefore recommend that the authors carefully check their list of references for any mistakes.

b) While the general quality of the figures is very good, I find Figures 3 - 6 hard to read. In these figure, the authors attempt to compare geometries for two levels of theory with each other. While one does note differences, the structures lack a common reference point that would allow the reader to better gauge whether these differences are due to structural changes or due to slightly different viewing angles. Would it be possible to improve these figures and to align the geometries with respect to one of the involved molecules?

Experimental design
I identified some issues that would make it hard for others to exactly reproduce the results:

a) page 2, line 78: The authors say that during the geometry optimisations, the position of some atoms were restrained. It is not clear which atoms these are and why that choice was made.

b) In section 3.3 (and in Tables 2 and 3) the authors refer to different models for each of the five tested reactions that all vary in system size. This is the first time in the manuscript that the authors refer to this and it is hard to follow how exactly these models differ from each other and what they actually look like. It therefore makes it hard to follow this section and I recommend a revision.

Validity of the findings
As mentioned above, I do understand what the authors are trying to achieve with their study. It is supposed to only serve as a starting point for the future development of an accurate benchmark set. Also, it is likely that the authors’ current computational capabilities are limited and that they therefore use previously published structures and energies. Unfortunately, the chosen level of theories for the geometry optimisations [B3LYP/6-31G(d,p)] and the singlepoint energy calculations [B3LYP/6-311+G(2d,2p)] are questionable for the studied systems and properties. I will first elaborate on this statement, starting with the quality of the geometries, before I will conclude with a recommendation on how the authors could address this problem without substantially delaying publication.

The structural stability of biomolecular systems - and of any sizeable molecule or molecular aggregate – is heavily influenced by important London-dispersion (van-der-Waals) effects, and it is a well-known fact among quantum chemists that B3LYP does not cover these effects. Moreover, small basis sets, such as 6-31G(d,p) induce the so-called basis-set superposition error (BSSE), which is an artificial overstabilisation of noncovalent interaction energies, including London dispersion and hydrogen bonds. BSSE exists both for inter- and intramolecular interactions and it has been demonstrated in the literature that it distorts molecular structures, see e.g. early works by van Mourik and co-workers on peptide conformers:
-van Mourik, T.; Karamertzanis, P. G.; Price, S. L., J. Phys. Chem. A 2006, 110, 8.
- Holroyd, L. F.; van Mourik, T., Chem. Phys. Lett. 2007, 442, 42.

Later, Goerigk and Reimers showed how error compensation between the lack of a proper dispersion treatment and BSSE can artificially generate structures of seemingly acceptable quality. However, it was also demonstrated that this error compensation cannot be controlled and that in fact using dispersion and BSSE corrections led to more accurate and reliable structures. This was first discussed for gas-phase structures of peptides, and later for the crystal structure of a protein fragment:
- Goerigk, L.; Reimers, J. R., J. Chem. Theory Comput. 2013, 9, 3240.
- Goerigk, L.; Collyer, C. A..; Reimers, J. R., J. Phys. Chem. B 2014, 118, 14612.

Efficient corrections for both London dispersion and BSSE exist that do not compromise the computational efficiency of B3LYP/6-31G(d,p) and they should also be mentioned (e.g. Grimme’s DFT-D3(BJ) and gCP corrections). This is particularly important, as this journal is directed at a readership that may not be familiar with these methodological developments and the dispersion and BSSE problems.

While BSSE is negligible for the 6-311+G(2d,2p) basis set, the singlepoint energies still suffer from the London-dispersion problem. The authors already mention the large GMTKN30 benchmark set in their manuscript. An additional study on GMTKN30 with nearly 50 methods has conclusively shown how dispersion can influence reaction energies and barrier heights by several kcal/mol (Goerigk, L.; Grimme, S.; Phys. Chem. Chem. Phys. 2011, 13, 6670.). A proper treatment of dispersion is therefore crucial. That same study has also shown that B3LYP is worse than the average of more than 20 hybrid density functionals for reaction energies. Consequently, using B3LYP/6-311+G(2d,2p) as a benchmark for the study of other methods may therefore distort the findings, and statistical values such as mean absolute deviations have to be interpreted with caution. In fact, it may well be that some of the tested semi-empirical methods perform much better than what the reported MADs may suggest. In this context, note that a recent study by Karton and Goerigk has demonstrated how the quality of the benchmark reference heavily influences the interpretation of low-level methods used to calculate barrier heights of pericyclic reactions (J. Comp. Chem. 2015, 36, 622). If one analyses the results of this study carefully, one can also see the importance of dispersion; for instance, it lowers the B3LYP barrier height of the Diels-Alder reaction between 1,3-butadiene and ethene by nearly 6 kcal/mol!

After this explanation, please let me emphasise that I am not against the way this study has been carried out. The studied reactions are highly interesting and the actual reaction barriers and energies provide useful insights for quantum chemists. Recalculating the reported numbers may therefore not be necessary at this stage and it can be postponed to a future study. Nevertheless, I am of the opinion that this manuscript can benefit from a discussion of the above points, particularly in the outlook section. Dispersion and BSSE should not be neglected and this manuscript is an ideal platform to convey this message to a non-expert readership that may not be aware of how the field of DFT approximations has changed over the past years. Such a discussion should also mention that also the tested semi-empirical methods lack from a proper description of noncovalent interaction energies and that others have combined them with DFT-D2 or DFT-D3-type corrections (DFTB3-D3, PM6-D3, etc. ) and with additional hydrogen-bond corrections for PM6 (see e.g. the works by Hobza and Korth). One or two paragraphs discussing the importance of these points are therefore sufficient.

Comments for the author
page 3, line 104: Is the quotation mark at the beginning a mistake or does it introduce a literal quote from another paper? If the latter is the case, a second quotation mark at the end of the quote is missing.

Reviewer 3 (Anonymous)
Basic reporting
I think this paper represents a nice effort of building a set for benchmarking barrier height in modelling biochemical systems. The authors provide themselves the benchmark for several semiempirical methods, using B3LYP as a reference point. Nevertheless, as the authors acknowledge, they provide a limited number of systems studied and a reference state (B3LYP) which could not be considered the state of the art. Thus, in my opinion the manuscript should be considered more as proposing the benchmark idea than a rigorous initial step towards it. In this regards I miss some section underlying how to expand the set, with some practical example. It could be even more interesting some efforts in making that part easier (semi-automatic) from standard QM or QM/MM calculations (although this part could be expanded in the future).

Experimental design
Since most reports are based on the comparison with B3LYP data (it seems to me that from other sources) I wander if the authors have reevaluated some of these numbers. Being single point evaluations it should not require excessive computational time.

Validity of the findings
To help the inspection of the results (and differences between B3LYP and semiempirical methods), I would like to see the distances (at least the most changing ones) being added in Figures 2-5.

Comments for the author
minor comments:

.) (line 47): "...TS structures are known to dependent significantly"
.) (line 104) " ...kcal/mol. “The larger..." (a non closed ")
.) (line 116) "PM7 (to 4.0 and" (non closed parenthesis)