2dload

From DISI
Revision as of 17:18, 9 November 2020 by Btingle (talk | contribs) (→‎Command Description)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

New 2D instructions

New commands

1. pre_process_partition.bash [partition_id] [tranches]

2. python 2dload.py add [partition_id] [preprocess_file] [catalog_shortname]

3. python 2dload.py rollback [partition_id] list

4. python 2dload.py rollback [partition_id] [shortname_list]

5. 2dwrapper.bash [partition_id] [catalog_shortname] [tranches]

6. python 2dload.py postgres [partition_id] [port_number] [shortname_list]

Command Description

1. The new preprocessing command. Now separate from loading so that preprocessing results may be saved elsewhere in case the database needs to be loaded from scratch again. Launches a number of slurm jobs to pre-process molecules, terminates once they have all completed. Saves a tarball of the collected results to /tmp/${PARTITION_ID}.pre

2. The new command for adding data to a database. Just like before, a partition id & catalog shortname are required. In addition, a ".pre" preprocessing file acquired from pre_process_partition.bash is required.

3. A new feature which lets you view the currently loaded data for a particular database. Any databases created before the update will display "legacy database" in their size field when this command is run

ex:

python 2dload.py rollback 110 list

tranche|date           |short  |catalog size
=======|===============|=======|===============
H26P370|08.31.19.11    |s      |3548322
H26P370|09.01.12.21    |m      |8485914
===============================================
H26P380|08.31.19.59    |s      |3996581
H26P380|09.01.13.50    |m      |8713228
===============================================
H26P390|08.31.20.48    |s      |4887562
H26P390|09.01.15.10    |m      |6547271
===============================================
H26P400|08.31.21.42    |s      |5240524
H26P400|09.01.16.16    |m      |4465551
===============================================
[11.05.09.49]: total substance table size: 44660929
[11.05.09.49]: total supplier table size:  22587771
[11.05.09.49]: total catalog table size:   45884953

python 2dload.py rollback 111 list

tranche|date           |short  |catalog size
=======|===============|=======|===============
H26P340|08.31.19.17    |s      |legacy archive
H26P340|09.01.13.15    |m      |legacy archive
H26P340|10.06.18.46    |s      |legacy archive
H26P340|10.07.11.32    |m      |legacy archive
===============================================
H26P350|08.31.20.09    |s      |legacy archive
H26P350|09.01.15.34    |m      |legacy archive
H26P350|10.06.21.55    |s      |legacy archive
H26P350|10.08.00.25    |m      |legacy archive
===============================================
H26P360|08.31.21.05    |s      |legacy archive
H26P360|09.01.17.34    |m      |legacy archive
H26P360|10.07.02.29    |s      |legacy archive
H26P360|10.08.15.00    |m      |legacy archive
===============================================
[11.05.09.49]: total substance table size: 48379301
[11.05.09.49]: total supplier table size:  23877467
[11.05.09.49]: total catalog table size:   50178572

(as you can see, partition 111 still has duplicate archives) (IMPORTANT: before you do anything loading-wise, you must delete *all* erroneous archives in the source directory)

4. Lets you roll back the database to a previous state. This previous state is controlled by the shortname_list argument, which is a comma-separated list of the catalogues that you want to roll the database back to. For example:

python 2dload.py rollback 0 list

tranche|date           |short  |catalog size
=======|===============|=======|===============
H00P000|08.31.19.17    |s      |100000
H00P000|08.31.19.18    |m      |2000
H00P000|09.31.17.15    |u      |-56

(uh oh, something is wrong with catalog u- let's roll it back)

python 2dload.py rollback 0 s,m
...
python 2dload.py rollback 0 list

tranche|date           |short  |catalog size
=======|===============|=======|===============
H00P000|08.31.19.17    |s      |100000
H00P000|08.31.19.18    |m      |2000

5. Wrapper script that will perform the entire loading process in one go, including preprocessing.

6. Not implemented yet, used for loading 2D data into postgres. Expects a partition, port, and list of shortnames- shortnames from this partition will be loaded into a postgres server at the specified port number. I've noticed some issues with the data we've been loading into postgres, specifically I've noticed that catalog information in the postgres databases is incorrect. We might need to re-load the postgres databases at some point, so be aware.

Additional

To run the 2dload.py script, a python3 environment must be used. Using a python2 environment will get you an "AttributeError: __exit__" error.