Noel O'Blog: September 2007

Friday 14 September 2007

RDKit: Not just yet another cheminformatics toolkit

I'm sitting here with a bandage covering my head, as RDKit has just blown my mind.

RDKit is an cheminformatics toolkit written in C++ and Python. It was developed in-house in a company called Rational Discovery (hence the RD) since 2001. In Feb 2006, it appeared on SourceForge under a liberal license (BSD, except for the GPL Qt code), an appearance which presumably coincided with the demise of Rational Discovery (the company, not the concept, that is :-) ). And there it stayed, actively developed by two developers, but unknown among the open source chemistry community until...

A month ago, I happened to be glancing through the SourceForge software map of chemistry software and I was intrigued by the description of RDKit as "A collection of cheminformatics and machine-learning software developed at Rational Discovery". The website was pretty minimal and there didn't even appear to be any documentation. I dashed off an email to Greg Landrum, the main developer (who it turns out is also the developer of YAeHMOP (Yet Another extended Huckel Molecular Orbital Package) ), and asked him what the story was. Two days ago, he returned from holidays and pointed me to the correct website and the documentation, and I couldn't believe what I was seeing...

Some features that I think are cool:
(1) Molecules based on the Boost Graph Library
(2) All the Python stuff works for me on Windows!
(3) 2D depiction!!!
(4) 2D depiction that mimics 3D conformations!!!
(5) 2D --> 3D conversion in a similar method to Rajarshi's smi23D! (doesn't use stochastic promixity embedding though)

Here's a summary of some of the rest: SMILES, substructure searching, sophisticated fingerprints, machine learning stuff, a GUI, clustering, MACCS keys, descriptors (84 or so), chemical reaction transformations, implementation of Recap (not sure what this is, but there's a ref in the docs), basic pharmacophore stuff, and two types of SSSR (there's some text about this in the docs).

Cool! There's obviously a lot of overlap between OpenBabel and RDKit, and hopefully we can use this to both projects' advantage in terms of testing against each other and developing interfaces. In the meanwhile, here's to diversity, and to discovering that someone else has implemented 2D depiction in C++ so I Don't Have To.

For more info, see www.rdkit.org, and in particular, the Python interface documentation which gives a good overview.

Image credit: Toolkit by Neil T (CC BY-SA 2.0)

Wednesday 12 September 2007

Exclusive: PRISM's response to scientific community's outrage

In a world exclusive, I can reveal the considered response of PRISM to the detailed arguments raised by Peter Suber, Peter Murray-Rust, Steve Harnad, etc., etc.:

For more thoughtful lampoonery, see Trapped in the USA and PISD.

* Image credit: This is a derivative of the image in the Wikipedia article "All your base are belong to us". I plead fair use, and a sense of humour.

Wikifying chemistry

There has been some interest in using wikis to annotate molecules: e.g. the ChemSpider blog where Antony is interested in using a local wiki to annotate entries, chem-bla-ics where Egon is trying to ensure that molecular data on Wikipedia can be accessed like a database, and most recently where PMR has commented on DBPedia.

This idea is already being used in biology. The Rfam database at the Sanger Institute is directly using wikipedia to create annotations for the major RNA families, which each have a page on Wikipedia. The full list of Wikipedia pages seems to be available here. From what I've heard second-hand, interested academics are invited to contribute to the pages on their favourite families. Every day the pages are downloaded and backed up, and the information made available through the Rfam database. All edits are tracked using Wikipedia's own tracking facilities (i.e. watchlists) so that vandalism is easily detected, although apparently this hasn't been a problem.

I'm not sure how much this idea could be used in chemistry (perhaps a database of drugs...?), but it sure is some food for thought...

Wednesday 5 September 2007

Scooby, D. D., where are you? - Searching for papers online

One of my pet annoyances is searching on a journal website for an article that I know exists, but not being able to find it. Let's take a real-life example...

I met Douglas Hawkins at the ACS and wanted to look up his papers in JCIM/JCICS. I've bookmarked the JCIM TOC page, so I go there. At the top is a handy little shortcut for finding papers by a particular author. So I type in "Hawkins" next to the Author drop-down box, and click "Search"...

10 documents. Of which half are Douglas Hawkins (the right guy), and the other half are Donald Hawkins (the wrong guy). In addition, both JCIM and JCICS were searched. Great. This is where I should have stopped. Instead I decided I only wanted to find those papers by Douglas Hawkins. The following are my attempts to find his papers:

Hawkins, D. M.: 7 article. All false positives.
Hawkins D M: ditto
"Hawkins, D. M.": 0 documents
D M Hawkins: It's my favourite 7 articles again.
D Hawkins: 24 articles. No articles by D. M. Hawkins included.

But how can I have found 24 articles for the last search? That's more than I found with just "Hawkins". Wait a second, I've been moved onto the "Advanced Article Search" page, which is searching all of the journals. So what did it find instead? It found "...Mass, J. D.; Hawkins, A. R.;". Pretty advanced searching, eh?

Thankfully, I happen to know that unlike most chemistry journals, the ACS has contributed data to PubMed. So off I go, and try:

Hawkins, D. M.: first hit contains "M. M. Hawkins"
DM Hawkins: first hit contains "Hawkins EC"
Hawkins DM: jackpot!

Now I just want those papers where Basak is a co-author:
Hawkings DM Basak: jackpot!

Incidentally, this is the first time that Pubmed has ever worked for me. This is because I always try "DM Hawkins Basak" which gives no hits (even after reading the instructions I was never able to find anything). As for how to limit the results to a particular journal? Why isn't there a drop-down box with a list of journals? Why do I have to read the instructions for the obscure syntax used by PubMed every time I want to find a paper?

What about Google scholar? "DM Hawkins" or "D. M. Hawkins" gives 1700 hits, of which the first page at least are true positives. Advanced Search allows me to specify the journal, and I get 1 hit with "J Chem Inf Comput Sci", a different hit for "J Chem Inf Comp Sci" (of which there are 5 versions, one of which has "Comput Sci" instead of "Comp Sci") and 1 for "J Chem Inf Model". In fact, Hawkins has 3 papers in JCICS and 2 in JCIM. You look good Google, but you're not trying very hard.

Searching by author should be easy. It's not quantum mechanics. It's not even rocket science. Can't they make it work better?

Monday 3 September 2007

Python code for Huckel theory calculations

Over at Chemical Quantum Images, Felix has posted Python code that will read a file containing a molecular structure and calculate the molecular orbitals using Huckel theory. The code uses the OpenBabel Python module to read the structure, and NumPy, the Python extension for numerical computing, to do the heavy lifting (to diagonalise a matrix).

Now, if I only understood the theory...:-)

	Blog	Comm
Me
Rich
Rajarshi
Egon