Difference between revisions of "2dload.py"

From DISI
Jump to navigation Jump to search
(Created page with "2dload.py is BKSLab's ZINC22 database management program, created by Benjamin Tingle. 2dload.py has three basic functionalities: * 2dload.py add * 2dload.py rollback * 2dloa...")
 
Line 11: Line 11:
 
== Adding data with 2dload ==
 
== Adding data with 2dload ==
  
python 2dload.py add ${partition id} ${preprocessed input} ${catalog shortname}
+
<nowiki>
 
+
python 2dload.py add ${partition id} ${preprocessed input} ${catalog shortname}</nowiki>
(refer to partitions.txt to see which range of tranche space each partition id is associated with)
+
(refer to partitions.txt to see the range of tranche space each partition id is associated with)
  
 
The add function will extract new entries from a preprocessed input file to each database table, for each tranche in the partition.
 
The add function will extract new entries from a preprocessed input file to each database table, for each tranche in the partition.
Line 22: Line 22:
 
Each tranche input file has two columns, one for molecule SMILES, and another for supplier codes.
 
Each tranche input file has two columns, one for molecule SMILES, and another for supplier codes.
 
For each tranche:
 
For each tranche:
     The input file is first split into its two component columns.
+
     The input file is first split into its two component columns. (substance, supplier)
 
     For each input column:
 
     For each input column:
 
         The input column file is concatenated with it's corresponding table file to a temporary file, which is then sorted and combed through by a uniqueness algorithm.
 
         The input column file is concatenated with it's corresponding table file to a temporary file, which is then sorted and combed through by a uniqueness algorithm.
         The effect of this is to create two new files, one containing all entries in the column file that are new to the table, and another containing the resulting id of each column input line. If an input line was a duplicate, the id will be the id of the corresponding original entry in the database.
+
         The effect of this is to create two new files, one containing all entries in the column file that are new to the table, and another containing the resulting primary key of each column input line. If an input line was a duplicate, the primary key will be that of the corresponding original entry in the database.
 
     The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
 
     The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
 
     The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.
 
     The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.
 
</nowiki>
 
</nowiki>

Revision as of 05:03, 25 December 2020

2dload.py is BKSLab's ZINC22 database management program, created by Benjamin Tingle.

2dload.py has three basic functionalities:

  • 2dload.py add
  • 2dload.py rollback
  • 2dload.py postgres

2dload.py operates on the level of ZINC22 partitions. (link, or whatever)

Adding data with 2dload

python 2dload.py add ${partition id} ${preprocessed input} ${catalog shortname}

(refer to partitions.txt to see the range of tranche space each partition id is associated with)

The add function will extract new entries from a preprocessed input file to each database table, for each tranche in the partition.

The algorithm is such:

The input file contains multiple tranche input files.
Each tranche input file has two columns, one for molecule SMILES, and another for supplier codes.
For each tranche:
    The input file is first split into its two component columns. (substance, supplier)
    For each input column:
        The input column file is concatenated with it's corresponding table file to a temporary file, which is then sorted and combed through by a uniqueness algorithm.
        The effect of this is to create two new files, one containing all entries in the column file that are new to the table, and another containing the resulting primary key of each column input line. If an input line was a duplicate, the primary key will be that of the corresponding original entry in the database.
    The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
    The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.