Tuesday 2 February 2010

Similarity searching using OpenBabel - find similar molecules in ChEMBLdb

You can use the 'babel' program provided with OpenBabel (on Windows as well as Linux) to search in a database for molecules similar to a particular query. The full details are in the Fingerprint Tutorial on the OpenBabel wiki, but here is a case study using ChEMBLdb which is available as an SDF file of 517261 molecules.

Note that we are using the default OpenBabel fingerprint for all of these analyses. This fingerprint is FP2, a path-based fingerprint (somewhat similar to the Daylight fingerprints).

(1) Download Version 2 of ChEMBLdb from ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/.

(2) After unzipping it, make a fastsearch index (this took 18 minutes on my machine, for the 500K+ molecules).
babel chembl_02.sdf -ofs

(3) Let's use the first molecule in the sdf file as a query. Using Notepad (or on Linux, "head -79 chembl_02.sdf") extract the first molecule and save it as "first.sdf". Note that the molecules in the ChEMBL sdf do not have titles; instead, their IDs are stored in the "chebi_id" property field.

(4) This first molecule is 100183. Check its ChEMBL page. It's pretty weird, but is there anything similiar in ChEMBLdb? Let's find the 5 most similar molecules. Because we have created the fastsearch index, this is extremely fast - on my machine it just takes 2 seconds:
babel chembl_02.fs mostsim.sdf -Sfirst.sdf -at5

(5) The results are stored in mostsim.sdf, but how similar are these molecules to the query?
babel first.sdf mostsim.sdf -ofpt
>
> Tanimoto from first mol = 1
Possible superstructure of first mol
> Tanimoto from first mol = 0.986301

> Tanimoto from first mol = 0.924051
Possible superstructure of first mol
> Tanimoto from first mol = 0.869048
Possible superstructure of first mol
> Tanimoto from first mol = 0.857143
6 molecules converted
76 audit log messages

(6) That's all very well, but it would be nice to show the ChEBI IDs. Let's set the title field of mostsim.sdf to the content of the "chebi_id" property field, and repeat step 5.
babel mostsim.sdf mostsim_withtitle.sdf --append "chebi_id"
babel first.sdf mostsim_withtitle.sdf -ofpt
>
>100183 Tanimoto from first mol = 1
Possible superstructure of first mol
>124893 Tanimoto from first mol = 0.986301
>206983 Tanimoto from first mol = 0.924051
Possible superstructure of first mol

>207022 Tanimoto from first mol = 0.869048
Possible superstructure of first mol
>607087 Tanimoto from first mol = 0.857143
6 molecules converted
76 audit log messages

(7) Here are the ChEMBL pages for these molecules: 100183, 124893, 206983, 207022, 607087. I think it is fair to say that they are pretty similiar. In particular, the output states that 206983 and 207022 are possible superstructures of the query molecule, and that is indeed true.

No comments: