## Sunday, March 23, 2014

### A general chemistry course meets Wolfram Alpha

Even the most thoughtful, dedicated teachers spend enormously more time worrying about their lectures than they do about their homework assignments, which I think is a mistake. Extended, highly focused mental processing is required to build those little proteins that make up the long-term memory. No matter what happens in the relatively brief period students spend in the classroom, there is not enough time to develop the long-term memory structures required for subject mastery.
To ensure that the necessary extended effort is made, and that it is productive, requires carefully designed homework assignments, grading policies, and feedback.

In a previous post I showed some examples of how some chemistry problems posted on Reddit and Yahoo can be solved simply by typing them into Wolfram Alpha (WA) and suggested that we should revisit the general chemistry curriculum in light of new tools such as WA.

To get more data I signed up for the Introduction to Chemistry MOOC offered by Coursera.  In this course the homework consists of 8 quizzes, each with between 11 and 15 questions plus a Pre-Course Concept Assessment Quiz. Here's what I found.

Pre-Course Concept Assessment Quiz
7 out of 15 questions could be done by typing them into WA.  For example: Solve the following system of two equations with two unknowns: x + y = 1 and 5x + y = 2. This questions tests for manual skills not really needed anymore, much like "what is the square root of 2?".

This question was much better: An architect presents a 3 inch wide by 4 inch deep by 3 inch tall model of a new central campus dorm. If the final building foundation is 126 feet wide, then how tall will the building be?  Of course this can also be solved with WA but the student must reformulate the question first.  I usually don't count questions such as this as square-root-of-2 problems.

Week 1 Introduction
4 out of 15 questions could be done by typing them into WA. For example: Wavelength of orange light is 0.00000060 m, scientific notation is ______ m.  A conceptual question on scientific notation would be much more useful.  Questions like this is also trivial in WA: A Boeing 747 carries 1.834 x 10^5 liters of jet fuel. Convert this volume to cm3.  These are "wasted" questions.

4 questions concerned significant figures, which WA does not handle.  For example: Perform the following calculation and input the answer expressed to the correct number of significant figures:  80720 ÷ (15.3 – 7.009) × 1.86.  This is an important skill that must be taught.  Even though it's not used in the rest of the course :).

Week 2 Matter and Energy
10 out of 13 questions could be done by typing them into WA.  For example: Name the following compound: CaF2.  Even: Which of the following neutral atoms has the smallest first ionization energy? Si, Sc, Sr, B, N.  A much better question is why the order is what it is, but how to phrase that as a multiple choice question?

Here's one that WA couldn't answer: Identify each of the following ions with their correct chemical symbol: Species with 8 protons and 10 electrons; Species with 30 protons and 28 electrons.

Week 3 Chemical Composition, Solutions, and Dissolution Equations

Here's one that requires some thought: A sample of sodium dichromate, Na2Cr2O7, is placed into a container by itself. The sample of material in the container is analyzed, and it is found to contain exactly 0.67 moles of sodium atoms. How many moles of oxygen atoms are in this sample?

Week 4 Chemical Composition, Solutions, Dissolution and Precipitation
1 out of 11 questions could be done by typing them into WA and it is: Determine the oxidation state of the nitrogen in each of the following molecules or ions: NO2^-1

WA is simply not yet equipped to handle questions like: A solution is known to contain only one type of anion. Addition of Tl1+ ion to the solution had no apparent effect (all ions remained in solution), but addition of Ba2+ ion resulted in a precipitate. Which anion is present? SO4^2-, Cl1-, I1-, NO3^1-

Week 5 no quiz

Week 6 Atomic Structure
5 out of 14 questions could be done by typing them into WA.  For example: Use the periodic table to write the electron configuration for the following element: Ba. By itself it is a pointless question. What you really want to know is: "How many core and valence electrons are in the following neutral atom? Se" which is also readily available from WA.

And just because a question cannot be answered easily with WA doesn't mean it's a good question. For example, what's the point of this question?: Give the number of s, p, d, and f electrons in the following neutral atom when it is in the ground state.

I liked this one though:
Below is the energy level diagram (not drawn to scale) representing the transitions made by an electron in a hydrogen atom that result in the observed lines of both the absorption and emission spectra. Some are in the visible region, and some are not.  4 different energy photons are represented (approximate wavelengths are given in parentheses): infrared (~ 10-4 m), red (~ 10-6 m), blue (~ 10-7 m), ultraviolet (~ 10-8 m).  Match the transition (a - h) with the photon described (approximate wavelengths are given in parentheses.) Your answer input should be a single, lower case letter. (Please note: This is not a problem for which a calculator is required. Your knowledge of the Bohr model of the atom and the relative energies of transitions is all that is needed.)
Week 7 Molecular Structures and Shapes
4 out of 11 questions could be done by typing them into WA.  Drawing Lewis structures is becoming a square-root-of-2 problem: Select the correct Lewis structure for the following ion. N3-. But most of the questions in this section are quite good and not immediately answerable by WA.

Week 8 Ideal Gas Law and Intermolecular Forces
2 out of 11 questions could be done by typing them into WA.  For example: A container of 8.03 x 10-3 moles of hydrogen gas has a volume of 20.9 mL and a temperature of 20.8 degrees C.  You could argue that this example requires some processing of the information given, but certainly all the "heavy lifting" in terms of units and conversions is done.

But, again, most of the questions in this section are quite good and not immediately answerable by WA.

Week 9 Solution Calculations
7 out of 12 questions could be done by typing them into WA.  For example imagine being completely stuck on "What is the mass of fructose, also known as fruit sugar (C6H12O6), in a 127 mL sample of glucose solution that has a concentration of a 1.44 M?". Simply type in 127 mL 1.44 M fructose in WA.

Surprisingly WA can't handle "What is the mass percent concentration of the solution if 11.9 g of ethanol is dissolved in 67.4 g of water? " directly.

Summary
So, 42% (48/113) of the questions can be easily done with WA and at least a 3rd are what I would call square-root-of-2 questions - questions that are no longer really meaningful in and of themselves.  And in many weeks the majority of the questions are like that.  That's a waste of very valuable student time and attention.

I should mention that there are also advanced problems sets that "need to be completed in order to achieve a statement of accomplishment with distinction."  Many of these are quite interesting problems that I think could be assigned to all students if they are taught to use WA effectively.  This is what we should be aiming for:
1. An 8-year-old child who weighs 66 pounds needs to be treated for a novel influenza A (H1N1) infection. For a child of this size, the total daily recommended amount of the antiviral drug Oseltamivir is 4.0 mg of drug per kg of body weight. The total daily amount of medication should be divided into two equal doses. (Source: Clinical Infectious Diseases 2009; 48:1003–1032.) A liquid suspension of this medication contains 12 mg Oseltamivir per mL. How many mg of the antiviral drug should be given to the child for her first dose?
1. Look for the chemistry terms and unfamiliar words. Do you understand all of the terms?
2. What is the question asking for?
3. Write down in words a short sketch for how you would solve this problem. What are the steps to solve this problem? Is any necessary information missing? Which information is provided that you do not need to answer the question?
4. Solve Problem 1: the answer is _______ mg.

## Friday, March 21, 2014

### Citations: some numbers from Denmark

Just out of curiosity I checked Web of Science (WOS) and found:

The most cited paper with a co-author working in Denmark is "Improved methods for building protein models in electron-density maps and the location of errors in these models" published in 1991 in Acta Crystallographica Section A with 12,625 citations.  The second most cited paper has 6,303 citations.  The Acta Cryst. A paper is the 7th most cited paper on the topic of "chemistry" (as defined by WOS) worldwide.

If we restrict the search to "chemistry" (as defined by WOS) then it is Improved prediction of signal peptides: SignalP 3.0 published in Journal of Molecular Biology in 2004 with 4,260 citations.

I would classify the latter paper as bioinformatics and for some reason the Acta Cryst. A paper didn't show up, so lets add "dept chem" to the search instead of restricting the search by subject. One of the co-authors on the Acta Cryst. A paper is from the Department of Chemistry at the University of Aarhus, so that's the top one. The next one is "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites" published in Protein Engineering in 1997 with 4,368 citations.

That still smells like bioinformatics to me, but just shows how versatile chemists are.  Anyway, the third most cited paper is definitely in the realm of traditional chemistry: "Peptidotriazoles on solid phase: [1,2,3]-triazoles by regiospecific copper(I)-catalyzed 1,3-dipolar cycloadditions of terminal alkynes to azides" published in Journal of Organic Chemistry in 2002 with 3,251 citations.

## Tuesday, March 18, 2014

### ROC curves and picking cutoffs

We just got the 2nd rounds of reviews for +Luca De Vico's latest PLoS ONE paper.  In the paper we try to predict HIV protease mutants that will cleave a particular peptide sequence and we use peptide-protein interaction energies as a measure of cleavability.  How well does this work?  The reviewer suggested ROC curves to quantify this.  Here's how it works.

We have 11 naturally occurring peptides that we know are cleavable (there are also some non-natural peptides that I'll ignore in this post) and 42 that we know are non-cleavable. Here are computed interaction energies (in kcal/mol) for all cleavable peptides and non-cleavable peptides which interaction energies < -40 kcal/mol .

 Cleaveable (11) Non cleaveable (42) -72 -68 -68 -68 -63 -64 -54 -63 -49 -62 -45 -62 -45 -57 -44 -52 -42 -47 -41

If we say that peptides with interaction energies < -40 kcal/mol are cleavable then we will have correctly predicted that all 11 cleavable peptides are cleavable, but also that 8 non-cleavable peptides will be cleavable.  Put another way, our "true positive" rate is 100% (11/11) and our "false positive" rate is 19% (8/42).

If we pick -45 kcal/mol as the cutoff the numbers are 91% and 10%: we have fewer false positives but we miss some true positives. The plot of true vs false positives is an ROC curve:

In a perfect world our true positive rate would be 100% and our false positive rate would be 0, so we are looking for the point closest to 0, 1, which happens to be -45 kcal/mol.

We can also quantify how good this approach is in general by finding the area under the curve, which will range from 1 (perfect) to 0.5 (useless) and, for example, compare two different methods for calculating the interaction energies

## Friday, March 7, 2014

### Open access and proposal review: two data points

One worry often expressed on-line with regard to publishing in open access journals is how it will impact ones chances of getting funded.  Here are two data points in that regard.

I just got two reviews back on a proposal I submitted to the Danish National Research Foundation, entitled "Quantum Biochemistry: New methods for computer aided design of new enzymes and drugs". The reviews are non-anonymous: one reviewer is from the US and the other from Finland and neither was suggested by me or appear as an author in the papers I reference.

Here's the relevant section of the proposal (note to self: include eLife next time)
Publication and Dissemination
All theoretical developments and applications will be published in peer-reviewed journals. As far possible we will publish in open access journals or journals with an open access option, to allow access to as many people as possible. However, any successful application to enzyme or drug design will be submitted to Nature or Science. All new theoretical methods will be incorporated into the GAMESS program, which is distributed free of charge to both academia and industry, and is the most popular non-commercial quantum chemistry program in the world.
Also, during the last two years I have published mostly in PLoS ONE and PeerJ, including all the results pertaining to this proposal.

Here's what the review form asks the reviewers to comment on with respect to this point:
Please write your comments on the overall considerations in the proposal with regard to the publication/dissemination/patenting of research results, briefly explaining both the strengths and weaknesses.
Here's what the reviewers wrote.

Reviewer 1
I’m glad that the researchers demonstrate a commitment to publishing their work and in open access journals and their software freely. I was somewhat disheartened to see the suggestion that the work will be submitted to Nature and Science only if the work is extremely successful (see Randy Sheckman’s thoughts on this, insightful even though they are not that they are all valid from my perspective).
Reviewer 2
I personally very much favour the open-access model, which is nicely taken shown in this application. Especially, since the created computational method will be freely available I am really happy with this part.
I don't know yet if the proposal will be funded, but if it isn't it won't because of the reviewers views on open access.

## Wednesday, March 5, 2014

### Top 10 reasons to not share your data (and why you should anyway)

Much has been made about the recently announced data policy at PLoS (see this post for summary of sorts or Google #plosfail). Reading some of this I was reminded of this excellent piece of writing by Randall J. LeVeque.  It is entitled "Top 10 reasons to not share your code (and why you should anyway)" but most of it applies equally well to data in my opinion.  Some excerpts follow.

Before discussing computer code, I'd like you to join me in a thought experiment. Suppose we lived in a universe where the standards for publication of mathematical theorems are quite di fferent: papers present theorems without proofs, and readers are expected to simply believe the author when it is stated that the theorem has been proved.
In this alternative universe the reputation of the author would play a much larger role in deciding whether a paper containing a theorem could be published. ...  Eventually some agitators might come along and suggest that it would be better if mathematical papers contained proofs. Many arguments would be put forward for why this is a bad idea. Here are some of them ...
1. The proof is too ugly to show anyone else. It would be too much work to rewrite it neatly so others could read it. And anyway it's just a one-o proof for this particular theorem, and not intended for others to see, or to use the ideas for proving other theorems. My time is much better spent proving another result and publishing more papers rather than putting more e ort into this theorem, which I've already proved
2. I didn't work out all the details. Some tricky cases I didn't want to deal with, but the proof works fine for most cases, such as the ones I used in the examples in the paper. (Well, actually I discovered that some cases don't work, but they will probably never arise in practice.)
3. I didn't actually prove the theorem, my student did.  And the student has since moved to Wall Street, and thrown away the proof, since course dissertations also need not include proofs.  But the student was very good, so I am sure it was correct.
4. Giving the proof to my competitors would be unfair to me. It took years to prove this theorem, and the same idea can be used to prove other theorems. I should be able to publish at least 5 more papers before sharing the proof. If I share it now my competitors can use the ideas in it without having to do any work, and perhaps without even giving me credit since they won't have to reveal their proof technique in their papers.
5. The proof is valuable intellectual property. The ideas in this proof are so great that I might be able to commercialize them some day, so I'd be crazy to give them away.
6. Including proofs would make math papers much longer. Journals wouldn't want to publish them and who would want to read them?
7. Referees will never agree to check proofs. It would be too hard to check correctness of long proofs and finding referees would become impossible.  It's already hard enough to find good referees and get them to submit reviews in finite time.  Requiring them to certify the correctness of proofs would bring the whole mathematical publishing business crashing down.
8. The proof uses sophisticated mathematical machinery that most readers/referees don't know. Their wetware cannot fully execute the proof, so what's the point in making it available to them?
9. My proof invokes other theorems with unpublished (proprietary) proofs. So it won't help to publish my proof - readers still will not be able to fully verify correctness.
10. Readers who have access to my proof will want user support. Anyone who can't fi gure out all the details will send email requesting that I help them understand it, and asking how to modify the proof to prove their own theorem. I don't have time or sta ff to provide such support.

### This spam mail made me laugh

From: xxx <xxx@synpeptide.cn>
Subject: The Molecule Calculator: A Web Application for Fast Quantum Mechanics-Based Estimation of Molecular
Date: March 4, 2014 2:03:02 PM GMT+01:00
To: <jhjensen@chem.ku.dk>

Dear Professor

How have you been? Hope everything goes well on your side.
Please excuse me to take the liberty of writing to you that I have just read a publication The Molecule Calculator: A Web Application for Fast Quantum Mechanics-Based Estimation of Molecular,
During your publication, I know  you used peptides in this research.

I am writing to tell you that our company Synpeptide which you could compare with your Current supplier.

First-Price; In order to show our sincerity, we will offer you 70%-80% of your Current supplier's prices .

Secondly-Delivery time: 1-2 weeks, We promise that if there is any delayed you get your peptides, you will get the peptides for free.

Of caurse, We understand you may worry about the quality of the peptides when you choose a new supplier.
I would like to tell you that Customers'satisfaction is our greatest pursuit, We promise you that if any problems result from the quality of products we offer, you will be refunded ten times of the products' value.

You can try to place an order to our company, then you could know our high quality peptides, at the same time you could compare our price, delivery time and service.

If you are busy today, I appreciate it much if you could forward this email to you students who are working with peptides.

Look forward to hearing from you.

Many thanks.

With kindest regards

xxx

## Sunday, March 2, 2014

### Book review: The Meaning of Life - On cactus finches, evolution and chaos

2014.03.20 update: Danish version

When I won my teaching award I knew to expect an interview in the school paper and, perhaps, the opportunity to speak about teaching at other departments.  However, I did not expect an invitation to speak at rotary meeting or a request to review a book for the University of Copenhagen alumni newsletter.  This blogpost is to collect my thoughts for the latter.

The review is part of a new series where alumni and university employees review books written by alumni. I was supplied a few possible author names but couldn't really find anything that looked appealing.  I briefly considered We, the Drowned because the story starts on the island where I grew up, but just couldn't see reading 700 pages by the deadline.

I finally got the idea to suggest a popular-science book.  The invitation stipulated fiction books so after my suggestion was shot down I could write a indignant e-mail on the importance of science literacy and be done with it.

Saxo.com (roughly, the Amazon.com of Denmark) lists only a handful of popular science books (isn't that sad?), but luckily one was written by a UC alum: The Meaning of Life - On cactus finches, evolution and chaos by Peter K. Busk (NB: I read the Danish version)  It looked interesting and was self-published, which I really like because I think that is the way of the future.  It's available only as an e-book (a 277 page PDF file) and costs 79 DKK (49 DKK for the English version).

The book outlines the case for the author's "theory of positive deltaS": that each individual living being is driven to maximize entropy (deltaS or chaos).  According to this theory, one excellent way of maximizing entropy is to produce offspring (i.e. more entropy producing units), but other ways include altruism, play, and consumerism - behavior that is difficult to explain via natural selection.

The book is divided into 10 chapters and each chapter is divided into short titled 1-2 page segments, which helps make the book very readable.  The first four chapters introduce the necessary scientific background to the non-expert: natural selection, the nature of a scientific theory, the laws of thermodynamics, and enzymes. This might sound like forbidding/boring stuff but the author makes use of some fun and memorable analogies: the genome becomes a handwritten book copied by medieval monks (polymerase enzymes) with varying degrees of sobriety who introduce errors in the text (mutations). Similarly, the first law of thermodynamics becomes a bookkeeper and entropy the result of a fox in the henhouse. These analgies worked very nicely for me and I actually learned a few new things about biology (the author is a molecular biologist) such as epigenetics and experiments performed to test natural selection.

Chapters 5 and 6 outlines the combination of natural selection and thermodynamics to form the theory of positive deltaS and present scientific observations that are consistent with the theory. The remaining four chapters interprets human behavior in terms of the theory and discusses its implications for pollution and overpopulation (and summarizes the book).  Humans are the masters of entropy production - turning chemical energy into heat - and this mastery has made us a superior species from a reproductive point of view but also a potential danger to ourselves. Can we save ourselves from overpopulation by maximizing our entropy production through other means than reproduction?

Is the theory right?  If you're asking to help you decide whether to read the book, the answer is "it doesn't matter". The facts in the book are correct, interesting and presented in an entertaining fashion and the interpretation of the facts in the light of the presented theory is sufficiently plausible (it would be hard to offer more solid quantitative or theoretical support in a book intended for the general reader) not to detract from the reading. I am happy I read it and I think you will be too.

By lucky coincidence (or possibly to maximize my entropy production) I came across this item while reading the book, which says many of the same things and links to a JCP paper with equations and everything.  To be a bit more precise the author (Jeremy England) uses non-equilibrium thermodynamics to argue "the more irreversible  the macroscopic process ... the more positive must be the minimum total entropy production".

England goes on to note that "that exponential growth of the kind just described [Darwinian "fitness"] is a highly irreversible process: in a selective sweep where the fittest replicator comes to dominate in a population, the future almost by definition looks very different from the past." (However, no argument is provided for whether replication provides the maximum irreversibility).

The equations indicate that "the replicator that dissipates more heat has the potential to grow accordingly faster. Moreover, we know by conservation of energy that this heat has to be generated in one of two different ways: either from energy initially stored in reactants out of which the replicator gets built (such as through the hydrolysis of sugar) or else from work done on the system by some time-varying external driving field (such as through the absorption of light during photosynthesis). In other words, basic thermodynamic constraints derived from exact considerations in statistical physics tell us that a self-replicator’s maximum potential fitness is set by how effectively it exploits sources of energy in its environment to catalyze its own reproduction. Thus, the empirical, biological fact that reproductive fitness is intimately linked to efficient metabolism now has a clear and simple basis in physics."

It looks like experiments and simulations are planned or underway to test this.  It look forward to seeing the results.