Friday 28 March 2008

Pybel as a generic API for cheminformatics libraries - proof of concept using CDK

I'm very interested in interoperability of open source chemistry codes. Following a comment by Egon on a recent post of mine, I started wondering whether the Pybel API could used with other cheminformatics libraries as a backend.

The advantage of this for the user would be (a) to reduce the learning curve - if you know how to use Pybel, you can access any of several different cheminformatics libraries with the same syntax, (b) the same scripts could be used to carry out a particular analysis using different cheminformatics libraries - different libraries may have different fingerprints, descriptors or implementations of particular algorithms (this is of course also useful for cross-checking the results of different programs) and (c) help reduce the divide between different cheminformatics toolkits (interoperability!!).

The rationale behind Pybel (described in the paper) lends itself to this use. Pybel doesn't attempt to wrap all the functionality of OpenBabel, but only the most common tasks in cheminformatics. For advanced options, or additional functionality, you can go behind the scenes and access OpenBabel directly. As a result, I propose that the Pybel API represents a generic API (one of many possible, of course) for accessing any cheminformatics library.

To test this, I have created CDKabel, a proof of concept which shows that the Chemistry Development Kit (CDK) can be accessed using Pybel syntax through Jython. CDKabel does not yet pass all of the Pybel tests, but there's enough to show that the approach has some merit. Compare the following: here's some Python code using Pybel and OpenBabel:
C:\Documents and Settings\oboyle>python25
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more inf
ormation.
>>> from pybel import *
>>> for mol in readfile("sdf", "head.sdf"):
... print "Molecule has molwt of %.2f and %d atoms" %
(mol.molwt, len(mol.atoms))
...
Molecule has molwt of 122.12 and 15 atoms
Molecule has molwt of 332.49 and 28 atoms
>>>
Now here's some Jython code with CDKabel and CDK:
D:\Tools\CDK>set CLASSPATH=cdk-1.0.2.jar

D:\Tools\CDK>..\jython2.2.1\jython
Jython 2.2.1 on java1.6.0_05
Type "copyright", "credits" or "license" for more informa
tion.
>>> from cdkabel import *
>>> for mol in readfile("sdf", "head.sdf"):
... print "Molecule has molwt of %.2f and %d atoms" %
(mol.molwt, len(mol.atoms))
...
Molecule has molwt of 122.04 and 15 atoms
Molecule has molwt of 331.96 and 28 atoms
>>>
Well, at least they agree on the number of atoms :-) (It's my fault - CDK has like, ten different ways of calculating the molecular mass, and I just chose randomly :-) )

I've only spent a few minutes throwing CDKabel together, so it doesn't do much beyond the example shown. However, if interested, you can download it and try it for yourself.

I'd appreciate comments on the idea that there is a core Python API that could be usefully applied to several cheminformatics libraries. Would anyone use CDKabel if it were available?

Wednesday 26 March 2008

MadMol - Chemistry-aware code

There are definitely better things you can do with the functional group fingerprint (FP4) of OpenBabel, but instead, I've created MadMol. Basically, give MadMol a SMILES string such as CC(=O)CCl and it will tell you all you ever wanted to know about that molecule:
That there's a Primary_carbon. Isn't that a Alkylchloride? Wake up and smell the Ketone, cause that's what it is. Most people wouldn't realise this is a C_ONS_bond. It's a 1,3-Tautomerizable (or your money back). I don't believe it, it's a Rotatable_bond! Wake up and smell the CH-acidic, cause that's what it is.

And here's the code (requires Pybel, and the file SMARTS_InteLigand.txt):

import sys
import random

import pybel

def readsmartsfile(filename="SMARTS_InteLigand.txt"):
patterns = []
inputfile = open(filename, "r")
for line in inputfile:
line = line.strip()
if line and line[0]!="#":
colon = line.find(":")
name = line[:colon]
smarts = line[colon+1:].strip()
patterns.append([pybel.Smarts(smarts), name])
return patterns

phrases = ["I don't believe it, it's a %s!",
"Isn't that a %s?",
"It's a whadyamacallit, a %s.",
"Looks like a %s to me.",
"That there's a %s.",
"Most people wouldn't realise this is a %s.",
"It's a %s (or your money back).",
"Wow, a %s. Last time I saw one of these, I hit the"
"fire alarm and ran.",
"Could be a %s...yes, I'm sure of it.",
"It's a %s if I've ever seen one.",
"Wake up and smell the %s, cause that's what it is.",
"It's a %s. I wish I had one.",
"You've hit the jackpot, you and your %s!",
"A %s. You know, back in the day, we used to have"
"fun with these.",
"It takes me back years, this %s does."]

if __name__=="__main__":
if not len(sys.argv)==2:
sys.exit("You need a SMILES string like CC(=O)CCl")
molecule = pybel.readstring("smi", sys.argv[1])
print "So you want me to tell you about %s, do you?\n" % (
sys.argv[1],)
patterns = readsmartsfile()
print " ".join([random.choice(phrases) % name for
smarts, name in patterns if smarts.findall(molecule)])

Friday 21 March 2008

Python - the scripting language of chemistry

Python is a cross-platform scripting language which is easy to learn, and encourages readability and elegant code. If you're a chemist, it's also the most widely-used scripting language out there:Leave a comment below if you feel I've missed something. Here are some dated but related links: ChemPython.org, the text of Andrew Dalke's talk from PyCon 2004, and Python for Chemistry in 21 Minutes.

Monday 17 March 2008

Cube file considered harmful

Several quantum chemistry packages can be encouraged to output the electron density or the magnitude of the wavefunction for a particular orbital at grid points in a volume. This sort of volumetric data is commonly known as a cube file, as this is the name that Gaussian uses, although the volume may not necessarily be a cube. Such a cube file can be used to plot molecular orbitals at various isosurface values.

And that basically, is all they're good for. Nice pictures.

So, what's wrong with that? Well, try doing some sort of analysis with them. Let's say you're into conceptual DFT and want to start partitionally the electron density in weird and wonderful ways. You'll soon find that the cube file is unusable.

For example, during my PhD I wrote a Python script to automate the calculation of Hirshfeld atomic charges using cube files. To do this you need the electron density of the molecule itself, as well as that for each atom on its own. In my thesis, I have a test case using HCN. This required a cube of volume 10x10x12Ang3 with a grid spacing of 0.075Ang. That's pretty fine spacing, but it's required for accurate evaluation of the volume integration. And that's a pretty big cube, but it turns out that electron density falls off pretty slowly with distance.

I gave up trying calculate Hirshfeld charges for the molecules in which I was actually interested, ruthenium polypyridyl complexes. I worked out that a cube file of the appropriate size would be 13GB in size. And like I said, you don't just need one.

But what if I had access to the basis set itself and the molecular orbital coefficients? Then I could have bypassed the cube file and accessed the underlying mathematical functions. At a recent meeting on Tools, Visualisation and GUIs organised by CCP1, the GAMESS-UK developers have agreed to work with cclib to make this happen. This would allow users/developers to write software to analyse the electron density independent of any particular QM package. If you're interested, get in touch.

To finish, here are some pretty pictures relating to Hirshfeld charge analysis. First here's HCN:The next image shows (a) the promolecule (sum of the electron densities of each of the atoms on its own), (b) the electron density, and (c) the difference between a & b, referred to as the deformation density, which shows how electron density has changed on formation of the molecule.And finally, for the carbon atom, (a) the free atom density, (b) the bonded atom density, and (c) the atomic deformation density.

Monday 10 March 2008

Pybel publication in Chemistry Central Journal

Our Pybel paper has just been published in the new Open Access journal, Chemistry Central Journal:
Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit Noel M O'Boyle, Chris Morley and Geoffrey R Hutchison. Chemistry Central Journal 2008, 2:5. [DOI]

Here's the abstract:
Background

Scripting languages such as Python are ideally suited to common programming tasks in cheminformatics such as data analysis and parsing information from files. However, for reasons of efficiency, cheminformatics toolkits such as the OpenBabel toolkit are often implemented in compiled languages such as C++. We describe Pybel, a Python module that provides access to the OpenBabel toolkit.

Results

Pybel wraps the direct toolkit bindings to simplify common tasks such as reading and writing molecular files and calculating fingerprints. Extensive use is made of Python iterators to simplify loops such as that over all the molecules in a file. A Pybel Molecule can be easily interconverted to an OpenBabel OBMol to access those methods or attributes not wrapped by Pybel.

Conclusion

Pybel allows cheminformaticians to rapidly develop Python scripts that manipulate chemical information. It is open source, available cross-platform, and offers the power of the OpenBabel toolkit to Python programmers.

The paper describes the rationale behind Pybel, its implementation and some examples of its use. If you are looking for help on how to install or use Pybel, see the Python page on the OpenBabel website, or the examples page.

And if you don't like Python (!), you can also access OpenBabel from Perl, Ruby, Java and C++. So check it out.