ZINC22 Partitions

Revision as of 04:27, 25 December 2020 by Btingle (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small, unimportant, or both, does it really make sense to have separately managed databases for each tranche- of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible.


Thus partitions were created. A partition consists of one or more tranches that are physiochemically adjacent, such that each one forms a square or rectangle on the tranche "grid".

Each partition is generated to be roughly equal in size to other partitions of the same importance. Partitions of higher importance will have fewer molecules per partition.

  • In order to judge whether two partition candidates have equal size, you need to know how large each tranche will be. This will inevitably change over the course of ZINC22's life, but we want a static configuration. In this case we need to settle for a large initial sample. The data used to generate partitions was from a set of 22 Billion molecules.

This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them.


To be clear- database partitions are still managed on the level of tranches (we want insert performance), but the partition creates logical groupings of tranches that us humans can interface with.

Pictured below is a colorful rendering of our current partitions. The X-axis in this image is heavy atom count, the Y-axis is logP value. You can see some of the regions of importance very clearly- the large regions in the right side of tranche space are small/unimportant, so each is assigned just one partition. The more active regions towards the center are the more important regions. It's hard to tell, but there are fifteen different regions in this image, corresponding to a 3x5 cut on tranche space.

Partitions new.PNG