Monday 26 September 2011

chemfp 1.0 - Get your fingerprints off this

Andrew Dalke has just released V1.0 of his chemical fingerprint package, chemfp. The goal of the project is to produce a standardised file format called FPS for chemical fingerprints, as well as a set of tools around them. The project itself is a Python module with some superfast C code.

Why is this needed? Well, right now, if you wanted to use several different toolkits to generate fingerprints and then compare and contrast their use for some application (e.g. 2D similarity searching), you would have to go to quite some lengths to figure out how to handle the file format from each toolkit, and so forth. You could of course avoid the file format completely and use a standard API such as Cinfony, but it's often useful to precalculate the fingerprint and store as a file (also, you may not have access to the toolkit, only a file of fingerprints).

So, if you have the appropriate toolkit, you can use chemfp to generate fingerprints in the FPS format. For example, if you have the Python bindings for Open Babel, you can use generate the FP2, FP3, FP4 and MACCS fingerprints in FPS format. Other toolkits are supported, namely OEChem and RDKit, with their own fingerprints. Of course, it would make sense for this format to be supported by the toolkits themselves, and indeed Cactvs already supports the FPS format, as does Rajarshi's fingerprint R package. Direct FPS support by Open Babel is also on the cards.

What about the tools around this format? Right now, there's only a similarity search tool, but that already is very useful as (for example) it supports "many against many" searches, a feature which I have heard requested by several Open Babel users. More tools are on the way though. It's also possible to write your own tools using the Python API. For example, Andrew has written up examples on generating a distance matrix and drawing a cluster dendrogram (in about 20 lines of code, albeit with the help of a few libraries), and on Taylor-Butina clustering (this one maybe 40 lines).

So check it out at http://chem-fingerprints.googlecode.com. In particular, the documentation is excellent.

Image credit: Jack of Spades

1 comment:

Andrew Dalke said...

Thanks for spreading the word, Noel!