Difference between revisions of "Docking Analysis in DOCK3.8"

From DISI
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Location of new scripts/Install Instructions ==
 
== Location of new scripts/Install Instructions ==
  
/wynton/home/btingle/bin/top_poses
+
You can retrieve these scripts from the DOCK 3.8 repository @ https://github.com/docking-org/DOCK/tree/dev
  
All programs described are located on this directory for now. Copy the directory to your own $HOME or wherever you see fit. Github link soon.
+
<nowiki>
 +
git clone https://github.com/docking-org/DOCK.git
 +
cd DOCK
 +
git checkout dev</nowiki>
 +
 
 +
The scripts are located @ analysis/top_poses in the DOCK repository.
 +
 
 +
=== Python 3.8 ===
 +
 
 +
You need to include a link to a python3.8 executable in the top_poses directory for run_top_poses.bash and run_top_poses_mr.bash to work. This needs to be a link, you cannot copy the executable- it expects to be installed in particular directory. There are no pip requirements, just a blank python 3.8 install.  
  
Note the link to python3.8 in this directory. You need to include a link to a python3.8 executable in your personal bin directory. There are no pip requirements, just a blank python 3.8 install. You can also just use mine @ /wynton/home/btingle/soft/python-3.8-install/bin/python3.8
+
<b>On Wynton you can use the version installed @ /wynton/group/bks/soft/python-versions/python-3.8-install</b>
 +
 
 +
If you want to install python3.8 on your own, try the following:
 +
 
 +
<nowiki>
 +
wget https://www.python.org/ftp/python/3.8.8/Python-3.8.8.tgz
 +
 
 +
# MY_SOFT is the directory you want to install to
 +
tar -C $MY_SOFT -xzf Python-3.8.8.tgz
 +
pushd $MY_SOFT/Python-3.8.8
 +
./configure --prefix=$MY_SOFT
 +
make && make install
 +
popd
 +
 
 +
# add the new python 3.8 executable to your path to use
 +
export PATH=$PATH:$MY_SOFT/python-3.8-install/bin
 +
 
 +
# optional: clean up the configuration files
 +
# rm -r $MY_SOFT/Python-3.8.8.tgz
 +
# rm Python-3.8.8.tgz</nowiki>
  
 
== Scripts Description ==
 
== Scripts Description ==
Line 13: Line 41:
 
==== Description ====
 
==== Description ====
  
Main pose retrieval algorithm, runs on multiple cores. 7 cores is recommended and also the default.
+
Main pose retrieval algorithm, runs on multiple processes.
  
 
Input can be a directory or a file. If input is a directory, the script will use a recursive find command to locate all test.mol2.gz* files residing in the directory structure.
 
Input can be a directory or a file. If input is a directory, the script will use a recursive find command to locate all test.mol2.gz* files residing in the directory structure.
Line 36: Line 64:
  
 
  <nowiki>
 
  <nowiki>
python3.8 top_poses.py <input> <output> <<ncores>></nowiki>
+
usage: top_poses.py [-h] [-n NPOSES] [-o OUTPREFIX] [-j NPROCESSES] [--id-file INPUT_ID_FILE] dockresults_path
 +
 
 +
Retrieve the top N poses from docking results
 +
 
 +
positional arguments:
 +
  dockresults_path      Can be either a directory containing docking results, or a file where each line points to a docking results file.
 +
 
 +
optional arguments:
 +
  -h, --help            show this help message and exit
 +
  -n NPOSES            How many top poses to retrieve, default of 150000
 +
  -o OUTPREFIX          Output file prefix. Each run will produce two files, a mol2.gz containing pose data, and a .scores file containing relevant score information. Default is "top_poses"
 +
  -j NPROCESSES        How many processes should be dedicated to this run, default is 8.
 +
  --id-file INPUT_ID_FILE
 +
                        Only retrieve poses matching ids specified in an external file.
 +
</nowiki>
  
 
=== run_top_poses.bash ===
 
=== run_top_poses.bash ===
Line 43: Line 85:
  
 
Wrapper script for top_poses.py, can be used to submit individual pose jobs. Will run with 7 cores allocated.
 
Wrapper script for top_poses.py, can be used to submit individual pose jobs. Will run with 7 cores allocated.
 +
 +
Anything in args will be passed to top_poses.py on top of any arguments generated by the script. Valid additional arguments would be -n {X} or --id-file {X}.
  
 
==== Usage ====
 
==== Usage ====
  
 
  <nowiki>
 
  <nowiki>
run_top_poses.bash <input> <output></nowiki>
+
run_top_poses.bash <input> <output> <args></nowiki>
  
 
==== Typical qsub usage ====
 
==== Typical qsub usage ====
  
 
  <nowiki>
 
  <nowiki>
qsub -wd $PWD run_top_poses.bash <input> <output></nowiki>
+
qsub -wd $PWD run_top_poses.bash <input> <output> <args></nowiki>
  
 
=== run_top_poses_mr.bash ===
 
=== run_top_poses_mr.bash ===
Line 69: Line 113:
  
 
Only works on sge for right now. Tested on Wynton.
 
Only works on sge for right now. Tested on Wynton.
 +
 +
Anything in args will be passed to top_poses.py on top of any arguments generated by the script. Valid additional arguments would be -n {X} or --id-file {X}.
  
 
==== Usage ====
 
==== Usage ====
  
 
  <nowiki>
 
  <nowiki>
run_top_poses_mr.bash <input> <staging directory> <<batch size>></nowiki>
+
run_top_poses_mr.bash <input> <staging directory> <args></nowiki>
  
 
== Checking Logs ==
 
== Checking Logs ==
  
After your jobs have finished, check the logs to see if anything went wrong. If everything went smoothly, there should be nothing in the .err logs, and each .out log should end with a string of text that looks like this:
+
After your jobs have finished, check the logs to see if anything went wrong.  
 +
 
 +
<nowiki>
 +
<staging directory>/logs</nowiki>
 +
 
 +
If everything went smoothly, there should be an output file corresponding to each input file, there should be nothing in the .err logs, and each .out log should end with a string of text that looks like this:
  
 
  <nowiki>
 
  <nowiki>
Line 88: Line 139:
  
 
If you submitted with run_top_poses_mr.bash, all you need to do is to run it again with the same parameters as before. The script detects existing output and will only re-submit as necessary. This will also update the output_final.poses.mol2.gz file.
 
If you submitted with run_top_poses_mr.bash, all you need to do is to run it again with the same parameters as before. The script detects existing output and will only re-submit as necessary. This will also update the output_final.poses.mol2.gz file.
 +
 +
You may also see a message that looks like this:
 +
 +
<nowiki>
 +
short timeout reached while retrieving pose... trying again! curr=...</nowiki>
 +
 +
This just indicates slowness in the file reading, and is common to see at the beginning of a log or when the filesystem is under high load.

Latest revision as of 17:13, 27 July 2021

Location of new scripts/Install Instructions

You can retrieve these scripts from the DOCK 3.8 repository @ https://github.com/docking-org/DOCK/tree/dev

git clone https://github.com/docking-org/DOCK.git
cd DOCK
git checkout dev

The scripts are located @ analysis/top_poses in the DOCK repository.

Python 3.8

You need to include a link to a python3.8 executable in the top_poses directory for run_top_poses.bash and run_top_poses_mr.bash to work. This needs to be a link, you cannot copy the executable- it expects to be installed in particular directory. There are no pip requirements, just a blank python 3.8 install.

On Wynton you can use the version installed @ /wynton/group/bks/soft/python-versions/python-3.8-install

If you want to install python3.8 on your own, try the following:

wget https://www.python.org/ftp/python/3.8.8/Python-3.8.8.tgz

# MY_SOFT is the directory you want to install to
tar -C $MY_SOFT -xzf Python-3.8.8.tgz
pushd $MY_SOFT/Python-3.8.8
./configure --prefix=$MY_SOFT
make && make install
popd

# add the new python 3.8 executable to your path to use
export PATH=$PATH:$MY_SOFT/python-3.8-install/bin

# optional: clean up the configuration files
# rm -r $MY_SOFT/Python-3.8.8.tgz
# rm Python-3.8.8.tgz

Scripts Description

top_poses.py

Description

Main pose retrieval algorithm, runs on multiple processes.

Input can be a directory or a file. If input is a directory, the script will use a recursive find command to locate all test.mol2.gz* files residing in the directory structure.

If input is a file, each line in the file should map to a valid pose file, e.g:

/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0000/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0001/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0002/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0003/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0004/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0005/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0006/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0007/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0008/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0009/test.mol2.gz

Output is where the top 300K poses will be written out when the script has finished. e.g /scratch/top_poses.mol2.gz

Usage

usage: top_poses.py [-h] [-n NPOSES] [-o OUTPREFIX] [-j NPROCESSES] [--id-file INPUT_ID_FILE] dockresults_path

Retrieve the top N poses from docking results

positional arguments:
  dockresults_path      Can be either a directory containing docking results, or a file where each line points to a docking results file.

optional arguments:
  -h, --help            show this help message and exit
  -n NPOSES             How many top poses to retrieve, default of 150000
  -o OUTPREFIX          Output file prefix. Each run will produce two files, a mol2.gz containing pose data, and a .scores file containing relevant score information. Default is "top_poses"
  -j NPROCESSES         How many processes should be dedicated to this run, default is 8.
  --id-file INPUT_ID_FILE
                        Only retrieve poses matching ids specified in an external file.

run_top_poses.bash

Description

Wrapper script for top_poses.py, can be used to submit individual pose jobs. Will run with 7 cores allocated.

Anything in args will be passed to top_poses.py on top of any arguments generated by the script. Valid additional arguments would be -n {X} or --id-file {X}.

Usage

run_top_poses.bash <input> <output> <args>

Typical qsub usage

qsub -wd $PWD run_top_poses.bash <input> <output> <args>

run_top_poses_mr.bash

Description

Map-reduce script to submit a number of analysis jobs and combine their results. The preferred method of running large analysis workloads.

Input field is evaluated the same as in top_poses.py.

Staging directory should be an NFS directory writable by your user. This is where input/output will be stored by the script.

Final output will show up in <staging directory>/output_final.poses.mol2.gz

Batch size refers to how many poses files will be evaluated by each job, the default is 1000, though you may want to modify this depending on the properties of your poses files/how many there are.

Only works on sge for right now. Tested on Wynton.

Anything in args will be passed to top_poses.py on top of any arguments generated by the script. Valid additional arguments would be -n {X} or --id-file {X}.

Usage

run_top_poses_mr.bash <input> <staging directory> <args>

Checking Logs

After your jobs have finished, check the logs to see if anything went wrong.

<staging directory>/logs

If everything went smoothly, there should be an output file corresponding to each input file, there should be nothing in the .err logs, and each .out log should end with a string of text that looks like this:

received all input!
joining threads...
done processing! writing out...
299900 / 300000

If you find an output file that doesn't end like this, you may wish to re-attempt that particular job.

If you submitted with run_top_poses_mr.bash, all you need to do is to run it again with the same parameters as before. The script detects existing output and will only re-submit as necessary. This will also update the output_final.poses.mol2.gz file.

You may also see a message that looks like this:

short timeout reached while retrieving pose... trying again! curr=...

This just indicates slowness in the file reading, and is common to see at the beginning of a log or when the filesystem is under high load.