Cassidy clustering

From DISI
Jump to: navigation, search

These are notes to myself from Teague/Cassidy for the new cluster.

I've written some code to generate Daylight-like fingerprints (Base-64) from the output of Chemaxon's generatemd. This is a wonderfully named program called "binstr2base64.py" although it can do the opposite as well. It's currently in /raid1/teague/Code/cluster/binstr2base64.py. It seems to work fine however there's a small bug when reading from the standard input after a SIGKILL interrupt. This is not really a big deal as it's pretty hard to recreate under normal circumstances.

I haven't verified that the BASE-64 encoding is compatible with the mappings in fast_tanimoto.c but the decode-encode methods are consistent. i.e. running generatemd's output though binstr2base64 twice will give the same fingerprint (sans delimiters)

How To (using eMolecules as a demo with 1024-bit ECFP4-like fingerprints):

cd /raid1/people/teague/Code/cluster/test
gzip -dc gzip -dc /raid9/db/zinc-may08/byvendor/emol/emol_p0.smi.gz > emol_p0.smi
/raid3/software/jchem/current/bin/generatemd c emol_p0.smi -k CF -f 1024 -2 | /raid1/people/teague/Code/cluster/binstr2base64.py | paste -d' ' emol_p0.smi - > emol_p0.smi.fp
rm emol_p0.smi

This will be a BIG file (and I've already generated it) so if you want a smaller test simply to verify IO & decoding you can use test.smi, which is the first 10 or so compounds. You'll have to generate the fingerprints yourself however.

A note for Cassidy: If you experience python errors run:

source /raid3/software/python/bin/python-env.csh (or python-env.sh if you're using BASH)

I'll verify the binary-base64 mappings are correct but I'd recommend checking yourself as well.