Difference between revisions of "DUDE"

From DISI
Jump to navigation Jump to search
m
m
Line 1: Line 1:
 
This is the Wiki Page for DUD-E, a directory of useful decoys - enhanced.  DUD-E is on the web at http://dude.docking.org.   
 
This is the Wiki Page for DUD-E, a directory of useful decoys - enhanced.  DUD-E is on the web at http://dude.docking.org.   
  
This page may be used for posting errors and ommissions in DUD-E, and also for commenting on the database's design and usefulness.  
+
This page contains documentation, FAQ, and may be used for posting errors and ommissions in DUD-E, and also for commenting on the database's design and usefulness.  
  
= File explanation =  
+
= Detailed file documentation =  
 +
First, you may not need any of these files. They are provided in an attempt to be completely transparent, but the "all" files linked on the DUDE website contains the files you should need: receptor, crystal ligand, actives (isomeric SMILES, mol2 and SDF) and decoys (same three formats)
  
The clustered sets live in dudgen_clustered while the full raw sets are in dudgen_ecfp4. Inside, the ligands.charge file contains the mapping from chembl ids to unique property sets, and the search/decoys.*.picked contain the actual decoys for that protonation set.
+
* Folders like P29274, P30543, or P46616 : these are swissprot codes. The directories contains preparation files specific to that individual code, which is often species specific.
 +
 
 +
* Docking, docking_auto
 +
 
 +
* dudgen_clustered, dudgen_ecfp4 - The clustered sets live in dudgen_clustered while the full raw sets are in dudgen_ecfp4. Inside, the ligands.charge file contains the mapping from chembl ids to unique property sets, and the search/decoys.*.picked contain the actual decoys for that protonation set.
  
 
The file formats are as follows:
 
The file formats are as follows:
Line 17: Line 22:
 
  first line: ligand SMILES input_id protonation_id
 
  first line: ligand SMILES input_id protonation_id
 
  SMILES ZINC_ID ZINC_Protonation_ID
 
  SMILES ZINC_ID ZINC_Protonation_ID
 
* subset_decoys.py in the target directory can covert the full dudgen_ecfp4 set into a dudgen_clustered type set given a list of molregno ids.
 
 
= More file explanations =
 
* Folders like P29274, P30543, or P46616 : these are swissprot codes. The directories contains preparation files specific to that individual code, which is often species specific.
 
 
* Docking, docking_auto
 
 
* dudgen_clustered, dudgen_ecfp4
 
  
 
*actives_*  including: actives_combined.ism, actives_final.ism, actives_murcko_1.ism, actives_murcko_1_30_nM.ism, actives_murcko_enumeration.ism, actives_nM_chembl.ism, actives_nM_combined.ism, actives_scaffolds.ism, actives_trimmed.txt, actives_final.mol2.gz, actives_final.sdf.gz
 
*actives_*  including: actives_combined.ism, actives_final.ism, actives_murcko_1.ism, actives_murcko_1_30_nM.ism, actives_murcko_enumeration.ism, actives_nM_chembl.ism, actives_nM_combined.ism, actives_scaffolds.ism, actives_trimmed.txt, actives_final.mol2.gz, actives_final.sdf.gz
Line 44: Line 40:
  
 
* scaffold_count.txt
 
* scaffold_count.txt
 +
 +
* subset_decoys.py in the target directory can covert the full dudgen_ecfp4 set into a dudgen_clustered type set given a list of molregno ids.
  
 
* uniprot.txt
 
* uniprot.txt

Revision as of 20:24, 12 July 2012

This is the Wiki Page for DUD-E, a directory of useful decoys - enhanced. DUD-E is on the web at http://dude.docking.org.

This page contains documentation, FAQ, and may be used for posting errors and ommissions in DUD-E, and also for commenting on the database's design and usefulness.

Detailed file documentation

First, you may not need any of these files. They are provided in an attempt to be completely transparent, but the "all" files linked on the DUDE website contains the files you should need: receptor, crystal ligand, actives (isomeric SMILES, mol2 and SDF) and decoys (same three formats)

  • Folders like P29274, P30543, or P46616 : these are swissprot codes. The directories contains preparation files specific to that individual code, which is often species specific.
  • Docking, docking_auto
  • dudgen_clustered, dudgen_ecfp4 - The clustered sets live in dudgen_clustered while the full raw sets are in dudgen_ecfp4. Inside, the ligands.charge file contains the mapping from chembl ids to unique property sets, and the search/decoys.*.picked contain the actual decoys for that protonation set.

The file formats are as follows:

  • ligands.charge - gives unique protonation states of input ligands.
Format: one ligand protonation form per line
SMILES input_id protonation_id mwt logp rb hba hbd charge
  • search/
    • decoys.<protonation_id>.picked - contains matched decoys for each unique ligand protonation state
Format: ligand protomer and then 50 matched decoys
first line: ligand SMILES input_id protonation_id
SMILES ZINC_ID ZINC_Protonation_ID
  • actives_* including: actives_combined.ism, actives_final.ism, actives_murcko_1.ism, actives_murcko_1_30_nM.ism, actives_murcko_enumeration.ism, actives_nM_chembl.ism, actives_nM_combined.ism, actives_scaffolds.ism, actives_trimmed.txt, actives_final.mol2.gz, actives_final.sdf.gz
  • common_scaffolds.ism
  • crystal_ligand.mol2
  • decoys_*, including: decoys_final.ism, decoys_scaffolds.ism, decoys_tabbed.ism, decoys_to_scaffolds.ism, decoys_final.mol2.gz, decoys_final.sdf.gz
  • inactives_*, including: inactives_combined.ism, inactives_nM_chembl.ism, inactives_nM_combined.ism
  • marginal_* including: marginal_actives_combined.ism, marginal_actives_nM_chembl.ism, marginal_actives_nM_combined.ism, marginal_inactives_combined.ism, marginal_inactives_nM_chembl.ism, marginal_inactives_nM_combined.ism
  • pdb_analyze.txt pdb_blessed.txt
  • receptor.pdb
  • scaffold_count.txt
  • subset_decoys.py in the target directory can covert the full dudgen_ecfp4 set into a dudgen_clustered type set given a list of molregno ids.
  • uniprot.txt


FAQ

Q1. How duplicates are removed

In the paper, you said "We then remove duplicate decoys from the ligand set by sorting decoys from least to most duplicated and assigned each decoy to the protonated ligand which has the least number of decoys already assigned." I don't know which molecules you considered as duplicates, and how to define and calculate the similarity. Please explain!

A1.

Decoys are uniquely identified by their ZINC protonation ids. We ensure that a particular protonation id (prot_id) is only assigned to one ligand of a given target.

  • 0) After filtering the for the 25% most dissimilar decoys
  • 1) map each prot_id to the number of different ligands it could be assigned to
  • 2) sort from the prot_ids that hit the fewest ligands to those that hit the most
  • 3) loop over that sorted list, starting with the most constrained decoys
  • 4) assign each prot_id to the the ligand with the fewest other prot_ids so far

The effect is to spread the decoys as evenly as possible among the ligands they could belong to.

END OF FAQ