Difference between revisions of "2dload.py"

From DISI
Jump to navigation Jump to search
Line 30: Line 30:
 
</nowiki>
 
</nowiki>
  
Why not just upload the data directly to postgres? Great question- inserting into a postgres table is slow (especially when the table has multiple indexes), and gets slower the larger the table gets. In our case, we also need to select query the data after we insert it to create the catalog table input. This leads to a bit of a paradox- in order to select query faster, we need to add indexes to the table, but in order to insert faster, we need to remove them. The solution is to sidestep postgres entirely and user our sorting method to ensure that data going in is already checked for uniqueness.
+
Why not just upload the data directly to postgres? Great question- inserting into a postgres table is slow (especially when the table has multiple indexes), and gets slower the larger the table gets. In our case, we also need to select query the data after we insert it to create the catalog table input. This leads to a bit of a paradox- in order to select query faster, we need to add indexes to the table, but in order to insert faster, we need to remove them. The solution is to sidestep postgres entirely and user our sorting method to ensure that data going in is already checked for uniqueness. This allows us to disable all indexes and triggers, copy the data to postgres in bulk, then re-enable indexes and triggers. This allows for the maximum insert performance possible.

Revision as of 05:19, 25 December 2020

2dload.py is BKSLab's ZINC22 database management program, created by Benjamin Tingle.

2dload.py has three basic functionalities:

  • 2dload.py add
  • 2dload.py rollback
  • 2dload.py postgres

2dload.py operates on the level of ZINC22 partitions. (link, or whatever)

Adding data with 2dload

python 2dload.py add ${partition id} ${preprocessed input} ${catalog shortname}

(refer to partitions.txt to see the range of tranche space each partition id is associated with)

The add function will extract new entries from a preprocessed input file to each database table, for each tranche in the partition.

The algorithm is such:

The input file contains multiple tranche input files.
Each tranche input file has two columns, one for molecule SMILES, and another for supplier codes.
For each tranche:
    The input file is first split into its two component columns. (substance, supplier)
    For each input column:
        The input column file is concatenated with it's corresponding table file to a temporary file, which is then sorted and combed through by a uniqueness algorithm.
        The effect of this is to create two new files, one containing all entries in the column file that are new to the table, and another containing the resulting primary key of each column input line. If an input line was a duplicate, the primary key will be that of the corresponding original entry in the database.
    The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
    The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.

Why not just upload the data directly to postgres? Great question- inserting into a postgres table is slow (especially when the table has multiple indexes), and gets slower the larger the table gets. In our case, we also need to select query the data after we insert it to create the catalog table input. This leads to a bit of a paradox- in order to select query faster, we need to add indexes to the table, but in order to insert faster, we need to remove them. The solution is to sidestep postgres entirely and user our sorting method to ensure that data going in is already checked for uniqueness. This allows us to disable all indexes and triggers, copy the data to postgres in bulk, then re-enable indexes and triggers. This allows for the maximum insert performance possible.