On the following pages, the installation and usage of VSFlow is described in detail !
VSFlow is an open-source command-line tool written in Python for the ligand-based virtual screening of large compound libraries. It includes a substructure-based, a fingerprint-based and a shape-based virtual screening tool. Additionally, it provides a tool to standardize compound libraries and generate conformers.
First of all, you need a working installation of Anaconda (https://www.anaconda.com/products/individual) or Miniconda (https://conda.io/en/latest/miniconda.html). Both are available for all major platforms.
Second, you need to clone the VSFlow GitHub repository to your system or download the zip file and unpack it (in the following called the repository folder).
All following instructions assume working with a bash shell!
Navigate into the repository folder.
Now, you can install the required dependencies with the provided environment.yml file within the repository folder as follows:
conda env create --quiet --force --file environment.yml conda activate vsflow
Alternatively, you can also create a new conda environment and install the dependencies manually:
conda create -n vsflow python=3.7 conda activate vsflow conda install -c rdkit -c conda-forge -c viascience -c schrodinger rdkit xlrd xlsxwriter pdfrw fpdf pymol molvs matplotlib
The Python dependencies are:
Now, you can install VSFlow as follows:
pip install . --use-feature=in-tree-build
Always make sure the conda environment is activated. Now you can run VSFlow as follows:
vsflow {mode} {arguments}
For example, the following command will display all included modes (substructure, fpsim, shape, preparedb, managedb) and the general usage:
vsflow -h
To display all possible arguments for a particular mode, type as follows:
vsflow {mode} -h
For example, with the following command all arguments for mode substructure are shown:
vsflow substructure -h
VSFlow contains a tool to prepare compound libraries for virtual screening. It allows for standardization of the molecules, generation of fingerprints and generation of multiple conformers. As input files, sdf/sdf.gz, csv and excel (containing molecules as SMILES or InChI) and a bunch of different text files (.smi, sma, .ich, .tsv, .txt containing SMILES/InChI) are supported. The output file is a "virtual screening database" (.vsdb) file. The vsdb file is a pickle file containing all molecules in a special dictionary format ready to use with VSFlow.
Always make sure the conda environment is activated.
vsflow preparedb {arguments}
The following command will display the help with all available arguments:
vsflow preparedb -h
Required:
OR
Optional:
Navigate into the examples folder.
For the following examples, you can download all ligands from the PDB database (with ideal geometries) or the Chembl database directly within VSFlow as follows (you need a working internet connection), e.g.:
vsflow preparedb -d pdb -o pdb_ligs
With the above command, the file containing the pdb ligands is automatically downloaded and written to the output file named pdb_ligs.vsdb in the examples folder. You can perform perform all operations described in the following using this file as input.
However, to quickly demonstrate the usage, you can specify the SD file containing all approved FDA drugs (approx. 1600 compounds, downloaded from public Zinc database: https://zinc.docking.org/substances/subsets/fda/) in the examples folder as input:
vsflow preparedb -i fda.sdf -o fda_drugs
In the above example, the input is simply converted to a vsdb file, without performing any standardization or conformer/fingerprint generation. It is generally recommended to convert frequently used compound libraries to vsdb files because loading speed is typically much faster. If you do not provide an output file name, the output is written to prep_database.vsdb by default. It is not necessary to provide the file extension for the output file, the databases will always be saved as pickle file with the file extension .vsdb. However, it is essential to provide the file extension for the input file since VSFlow recognizes the file format from its file extension. Other supported input file formats are text files (.smi, .sma, .ich, .txt/.txt.gz, .csv/.csv.gz, .tsv/.tsv.gz), gzipped SD files (.sdf.gz) and excel files (.xlsx).
By specifying the -s/--standardize flag, all compounds in the database are additionally standardized:
vsflow preparedb -i fda.sdf -o fda_std -s
Standardization includes: standardization according to molvs [1] rules, disconnecting metals and salts and removing charges. Additionally, it is possible to generate the canonical tautomer (-can/--canonicalize argument) and store it in addition to the standardized molecule in the database file:
vsflow preparedb -i fda.sdf -o fda_std -s -can
Standardization and canonicalization is generally recommended and required to use some screening capabilities of VSFlow properly.
Since tautomer generation can be time-consuming for larger compound databases, parallelization is possible to speed things up via the -np/--nproc argument, e.g. by running on 6 cores/threads (probably available on most modern machines):
vsflow preparedb -i fda.sdf -o fda_std -s -can -np 6
Parallelization is done via Python's built-in multiprocessing module.
By specifying the -f/--fingerprint argument, the selected fingerprints are generated for all molecules and stored within the output database file:
vsflow preparedb -i fda.sdf -o fda_std_fps -s -can -f ecfp -np 6
The following fingerprints are supported:
Via the -r/--radius argument, the radius (default: 2) for circular fingerprints ecfp and fcfp can be changed. The -nb/--nbits argument changes the bit size of the fingerprint (default: 2048) and if --no_chiral argument, chirality is ignored, e.g.:
vsflow preparedb -i fda.sdf -o fda_fps -f ecfp -r 3 -nb 4096 --no_chiral
By specifying the -c/--conformers flag, 3D conformers for all database compounds are generated:
vsflow preparedb -i fda.sdf -o fda_confs -c
You can optionally use the --seed argument to specify a seed for the random number generator for reproducibility purposes:
vsflow preparedb -i fda.sdf -o fda_std_confs -c --seed 42
Caveat: 3D information of molecules read from an input SD file are overwritten when generating conformers!
By default, 20 conformers per molecule are generated. This may be changed using the --nconfs argument. By specifying the --rms_thresh argument, only those conformers out of --nconfs which have an RMSD deviation (calculated on heavy atoms) greater than the specified value are retained, e.g.:
vsflow preparedb -i fda.sdf -o fda_std_confs -c --nconfs 10 --rms_thresh 0.3
With the above statement, all compounds are standardized, then 10 conformers per compound are generated and only those with an RMSD deviation greater than 0.3 are retained. The --rmsd_thresh flag is useful to keep only those conformers with significant differences.
Since conformer generation can be time-consuming, it is reasonable to parallelize tasks using the -np/--nproc argument. Here you can specify the number of cores/threads of your system to be used for parallelization. VSFlow takes care you don't specify more threads than available. With the following command, molecules are standardized, 30 conformers are generated and fingerprints are calculated using 6 cores/threads:
vsflow preparedb -i fda.sdf -o fda_s_confs_fps -s -c -f ecfp --nconfs 30 -np 6
To further speed up calculations, the conformer generation itself can be further distributed to all available threads of the system via the C++ code of RDKit with the --boost flag:
vsflow preparedb -i fda.sdf -o fda_std_confs -s -c -f ecfp -np 6 --boost
This will distribute the calculations to 6 threads but will additionally use all other available threads to generate the conformers. You may see how the run time differs for the above examples on your machine.
Caveat: Make sure you do not run other important stuff when parallelizing to all available threads since this may slow down your machine!
Instead of specifying the -o/--output argument, it is also possible to "integrate" the database into VSFlow using the -int/--integrate and -intg/--int_global arguments. Do not provide a file extension, just specify a shortcut name:
vsflow preparedb -i fda.sdf -int fda -s -c
With the above command, the prepared database is saved as file named fda.vsdb to the folder "VSFlow_Databases" in the user's HOME directory. VSFlow can now access the database from throughout the system, the user only needs to pass the shortcut name to the -d/--database argument, e.g. for a substructure search (see page Substructure Search for more information):
vsflow substructure -smi C1=CN=CC=C1 -d fda -o pyr_subs.sdf
When the -intg/--int_global argument is specified instead, the prepared database is also saved to the folder $Home/VSFlow_Databases by default:
vsflow preparedb -i fda.sdf -intg fda -s
Both paths can be changed in the mode "managedb" (see Page "Manage Databases" for more information).
It may be useful to change the global database path (-intg/--int_global) if VSFlow is run on a server with multiple users and some databases should be accessible for all users.
VSFlow can perform a substructure search in compound libraries (databases). The database file have the format .sdf, .sdf.gz or vsdb (see Page "Prepare Databases" how to generate a .vsdb file). The query molecules/patterns can be provided in multiple formats (.sdf, .xlsx, .csv , .tsv, .smi, .sma, .ich) or can be directly specified as SMILES or SMARTS on the command-line. The implementation of the substructure search is based on the "GetSubstructMatches" functionality available for RDKit Mol objects.
vsflow substructure {arguments}
To display the help with all available arguments:
vsflow substructure -h
Required:
-sma , --smarts
specify SMARTS string on command line in double quotes
-d , --database
specify path/name of the database file [sdf, sdf.gz or vsdb] or specify the shortcut for an integrated database (not required if a default database is set, see Page "Manage Databases" for more information)
Optional:
Navigate into the examples folder.
In the following examples, the database file prepared on the Page 'Prepare Databases' (fda_std.vsdb) downloaded from the Zinc database (https://zinc.docking.org/substances/subsets/fda/) and the file generated by downloading the ligands from the PDB database are used (pdb_ligs.vsdb, see Page "Prepare Databases" for more information). The sample query files used contain three fragment-like compounds.
To perform a basic substructure search, specify an input file containing the query molecules and a database file:
vsflow substructure -i {.sdf, .xlsx, .csv, .smi, .sma, .ich, .txt} -d {.sdf, .sdf.gz, .vsdb}
To try it, use the following example:
vsflow substructure -i fragments.sdf -d fda.sdf
If no name for the output file (-o, --output) is specified, the output containing the matching database compounds is written to an SD file named substructure.sdf by default.
In addition to an SD file, the substructure results can also be written to a csv or xlsx file. Just specify the respective file extension, e.g.:
vsflow substructure -i fragments.sdf -d pdb_ligs.vsdb -o fragment_subs_pdb.xlsx
It is also possible to specify the query directly as SMILES (-smi/--smiles) or SMARTS (-sma/--smarts) on the command-line:
vsflow substructure -smi "FC1=CC=CC=C1" -d pdb_ligs.vsdb -o smiles_subs_pdb.csv
vsflow substructure -sma "[C,N]1CCCCC1" -d pdb_ligs.vsdb -o smarts_subs_pdb.sdf
Please make sure to put the SMILES or SMARTS in double quotes!
It is also possible to specify multiple SMILES or SMARTS at the same time, e.g.:
vsflow substructure -smi "FC1=CC=CC=C1" -smi "C1=CN=CC=C1" -d pdb_ligs_std.vsdb -o smiles_subs_pdb.csv
With the above command, the two smiles are treated as separate queries. By specifying the --combine flag, the two smiles are treated as one query. Only molecules in the database containing both queries as substructure are now written to the output file:
vsflow substructure -smi "FC1=CC=CC=C1" -smi "C1=CN=CC=C1" -d pdb_ligs.vsdb -o 2subs_pdb.csv --combine
You can also specify text files containing one SMILES (.smi file) or SMARTS pattern (.sma file) per line. Have a closer look at the files in the examples folder, e.g.:
vsflow substructure -i sample.sma -d pdb_ligs.vsdb -o sample_sub_pdb.csv
If the query molecules are provided as xlsx or csv file, VSFlow tries to automatically detect the molecule column (containing SMILES or InChI) as well as the separator (for csv files). In case it does not work at some point, an error message is displayed asking you to specify the parameters manually. You can do so via the --mol_column and --delimiter argument:
vsflow substructure -i fragments.csv -d pdb_ligs.vsdb --mol_column smiles --delimiter ","
When providing multiple query molecules at the same time, it may be useful to generate a dedicated output file for every query molecule with the -mf/--multfile flag:
vsflow substructure -i fragments.sdf -d pdb_ligs.vsdb -o fragment_subs_pdb.xlsx -mf
In the above case, three excel files (_1 - _3) containing the results for the respective query molecule are generated.
VSFlow also offers the possibility to directly filter the screening results by some physicochemical properties with the --filter argument, e.g.:
vsflow substructure -i fragments.csv -d pdb_ligs.vsdb -o frags_subs_filt.sdf --filter mw_300
With the above command, only those substructure results with a maximum molecular weight of 300 Da are written to the output file. In the following table, the supported properties and their shortcuts are summarized:
property | shortcut | example |
---|---|---|
Molecular Weight in Da | mw | mw_300 |
cLogP value | logp | logp_1.5 |
number of H donors | nhdo | nhdo_3 |
number of H acceptors | nhac | nhac_5 |
number of rotatable bonds | nrob | nrob_5 |
number of aromatic rings | naro | naro_3 |
number of heteroaromatic rings | nhet | nhet_3 |
topological polar surface area in A2 | tpsa | tpsa_80 |
Make sure to separate the shortcut for the property and the value with an underscore (see examples in the table). Please note that the specified values are always considered as upper limit, e.g. nhac_5 means that only compounds with a maximum of 5 hydrogen acceptors are considered. It is also possible to combine multiple properties, e.g.:
vsflow substructure -i fragments.sdf -d fda_std.vsdb -o frags_fda_filt.sdf --filter mw_400 --filter logp_3
With the above statement, only those database compounds containing the respective substructure and possessing a maximum molecular weight of 400 Da and a maximum clogP of 3 are written to the output file.
It is also possible to return only full matches, e.g. only compounds fully similar to the queries if any (in fact it is no longer a substructure search then, but perhaps useful sometimes), by specifying the -fm/--fullmatch flag:
vsflow substructure -i fragments.sdf -d pdb_ligs.vsdb -o frags_pdb_full.sdf -fm
In addition to the sdf/xlsx/csv output files, VSFlow can also generate a PDF file for the substructure results with 2D depiction of the molecule structure, highlighting of the substructure and annotation of molecular properties. Simply specify the --pdf flag:
vsflow substructure -i fragments.sdf -d fda_std.vsdb -o frags_subs_fda.sdf --pdf
If the -p/--properties is specified, all properties listed in the above table are additionally written to all output files:
vsflow substructure -i fragments.sdf -d fda_std.vsdb -o frags_subs_fda.sdf --pdf -p
For larger databases and numerous query molecules, it might be useful to parallelize the substructure search with the -np/--nproc argument on multiple cores/threads. VSFlow makes sure you do not use more threads than available:
vsflow substructure -i fragments.csv -d pdb_ligs.vsdb -o frags_subs_2.sdf -np 12
The above command performs the substructure search using 12 cores/threads of your system. Parallelization is done with Python's built-in multiprocessing capabilities.
The substructure search can be performed using different modes (std, all_tauts, can_taut, no_std) which can be specified with the -m/--mode argument. By default, the mode "std" is selected. To properly use the different modes, the database should be standardized with the "preparedb" functionality of VSFlow (see Page 'Prepare Databases'). Here is an overview what the respective mode does:
The different modes are useful to overcome issues arising from different molecule representations between query and database molecules. For example, the mode "can_taut" using the canonical tautomer may be useful to properly identify identical compounds despite being represented in different tautomeric states. To do so, run the substructure search in mode "can_taut" and return only full matches (see also above):
vsflow substructure -i sample.csv -d pdb_ligs.vsdb -o sample_full.sdf -fm -m can_taut
When using the mode "all_taut", all possible tautomers (up to 100 by default, may be changed with -nt/--ntauts argument) of the query molecules are used for the substructure search. This may also help to identify matching compounds particularly if the database is not standardized.
VSFlow includes a tool to search for similar compounds in large databases via fingerprint-based molecular similarity. It contains all fingerprints currently implemented in the RDKit (Morgan, RDKit, Topological Torsion and Atom Pairs fingerprint and MACCS keys) and different similarity measures (Tanimoto, Tversky, Cosine, Dice, Sokal, Russel, Kulczynski and McConnaughey similarity). It is possible to search for the n-nearest similar compounds as well as to provide a cutoff value for the similarity.
vsflow fpsim {arguments}
To display the help with all available arguments:
vsflow fpsim -h
Required:
-smi , --smiles
specify SMILES string on command line in double quotes
-d , --database
specify path/name of the database file [sdf, sdf.gz or vsdb] or specify the shortcut for an integrated database
Optional:
Navigate into the examples folder.
To perform a basic similarity search, simply specify an input file containing the query molecules (-i/--input) and a database file (-d/--database):
vsflow fpsim -i {.sdf, .xlsx, .csv, .smi, .sma, .ich} -d {.sdf, .sdf.gz, .vsdb}
You can try it with some example files provided in the repository folder (see Pages "Prepare Databases" and "Substructure Search" for more details):
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb
If no output file is specified, the results are written to an SD file named fingerprint.sdf. By default, a 2048 bit feature Morgan Fingerprint with radius 2 (FCFP4-like fingerprint) is used and the similarity is calculated with the Tanimoto metric. The 10 most similar compounds per query molecule are written to the output file by default. All of this may be changed with the appropriate arguments, e.g.:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -f ecfp -r 3 -nb 4096 -s dice -t 20 -o sample_sim.xlsx
With the above statement, a non-feature Morgan Fingerprint (-f/--fingerprint) with 4096 bits (-nb/--nbits) and radius 3 (-r/--radius), a so called ECFP6-like fingerprint, is used. Similarities are calculated with the Dice metric (-s/--similarity) and the 20 most similar compounds (-t/--top_hits) are written to the output excel file named sample_sim.xlsx (-o/--output). In the following, the implemented fingerprints and similarity metrics and their shortcuts are listed:
It is also possible to provide a cutoff value (-c/--cutoff) for the respective similarity metric. The cutoff value must be between 0 and 1. Only compounds with a similarity higher than the cutoff value are now returned and written to the output file, e.g.:
vsflow fpsim -i sample.sdf -d fda_std.vsdb -o sample_fda_sim.sdf -s dice -c 0.6
In the above case, only database molecules with a Dice similarity greater than 0.6 are written to the output file, if any (there should be no, set cutoff to 0.5 to get some results).
When using the non-symmetrical Tversky similarity, the Tversky alpha parameter (--tver_alpha, weighs the query molecule, default is 0.5) and the Tversky beta parameter (--tver_beta, weighs the database molecule, default is 0.5) can additionally be adjusted:
vsflow fpsim -i sample.sdf -d fda_std.vsdb -o sample_fda_tver.sdf -s tver --tver_alpha 0.7 --tver_beta 0.3
When providing numerous queries at the same time, it might be useful to generate separate output files for each query molecule with the -mf/--multfile flag:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_sim_pdb.xlsx -mf
Instead of SD files, the query molecules may be also read from csv or excel files (containing the molecule as SMILES or InChI representations). VSFlow will try to recognize the separator (for csv files) as well as the column containing the SMILES/InChI automatically. However, if things go wrong, it is possible to specify all necessary parameters separately:
vsflow fpsim -i sample.csv -d pdb_ligs.vsdb --delimiter ";" --mol_column smiles
SMILES (-smi/--smiles) can also be specified directly on the command-line, e.g.:
vsflow fpsim -smi "NCCC1=C(CCC2CCCC2)C=CN=C1" -d pdb_ligs.vsdb -o smiles_sim_pdb.xlsx
Please make sure to put the SMILES in double quotes!
It is also possible to specify multiple SMILES at the same time, e.g:
vsflow fpsim -smi "NCCC1=C(CCC2CCCC2)C=CN=C1" -smi "OC1=C(CC2CCCCC2)C=NC=C1" -d fda.sdf -o fda_sim.sdf
VSFlow also offers the possibility to directly filter the screening results by some common physicochemical properties using the --filter argument. The following table summarizes the supported properties and their shortcuts together with some examples:
property | shortcut | example |
---|---|---|
Molecular Weight in Da | mw | mw_300 |
cLogP value | logp | logp_1.5 |
number of H donors | nhdo | nhdo_3 |
number of H acceptors | nhac | nhac_5 |
number of rotatable bonds | nrob | nrob_5 |
number of aromatic rings | naro | naro_3 |
number of heteroaromatic rings | nhet | nhet_3 |
topological polar surface area in A2 | tpsa | tpsa_80 |
Please make sure to always separate the shortcut and the value with an underscore, as shown in the examples in the table. Please note that the specified values are always considered as upper limit, e.g. nhac_5 means that only compounds with a maximum of 5 hydrogen acceptors are considered. In the following example, only those molecules with a Dice similarity higher than 0.5 and with a maximum cLogP of 2 are written to the output file, if any:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -s dice -c 0.5 --filter logp_2
You can quickly check if the filter worked by writing all calculated properties from the table to the output files (-p/--properties flag, see table above), in this case an excel file for convenience:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_sim_filt.xlsx -s dice -c 0.5 -p --filter logp_2
It is also possible to specify multiple properties at the same time, e.g.:
vsflow fpsim -i sample.sdf -d fda.sdf -o sample_sim_filt.xlsx --filter logp_2 --filter mw_400
By default, the chirality of molecules is considered when generating fcfp, ecfp, atom pairs and topological fingerprints, e.g. two enantiomers will have different fingerprints. This can be changed with the --no_chiral flag:
vsflow fpsim -i sample.sdf -d fda.sdf -o --no_chiral
VSFlow offers the possibility to visualize the screening results as PDF file (--pdf flag) with 2D depictions of the molecules. In the following example, a PDF file containing the screening results along with some annotations, including the similarity value, the used fingerprint and the used similarity metric is generated in addition to the SD file:
vsflow fpsim -i sample.sdf -d fda.sdf -o sample_sim.sdf --pdf
If the --simmap flag is additionally specified, RDKit similarity maps are generated along with the 2D depiction of the molecule in the PDF file:
vsflow fpsim -i sample.sdf -d fda.sdf -o sample_sim_maps.sdf --pdf --simmap
Note: similarity maps can not be generated for MACCS keys and when using the Tversky similarity metric. Instead, only a 2D depiction of the molecule is written to the PDF file.
With the -p/--properties flag, all properties listed in the above table are additionally annotated for every results molecule in all output files:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_sim_maps.sdf -p --pdf --simmap
When screening numerous query molecules in large compound databases, it might be useful to parallelize the similarity search on multiple cores/threads with the -np/--nproc argument, e.g.:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -np 12
Python's built-in multiprocessing module is used for parallelization.
For large, frequently used compound databases it might be useful to calculate the respective fingerprint for each compound only once and store it along with the molecule in the database. This will speed-up subsequent similarity searches since only the fingerprints for the query molecules have to be generated during the search. You can use the mode preparedb of VSFlow to pre-calculate fingerprints and store them within the virtual screening database (.vsdb) file (see Page "Prepare Databases" for more information). As an example, you can directly download (you need a working internet connection) all ligands from the PDB database and calculate the respective fingerprints as follows (see Page "Prepare Databases" for more information):
vsflow.py preparedb -d pdb -o pdb_fps -f ecfp -r 3 -nb 4096
The above command will download the PDB ligands, generate a 4096 bit ECFP-6-Morgan fingerprint for all ligands and store the fingerprint along with the molecule in a virtual screening database file named pdb_fps.vsdb in the examples folder.
Of course you can also use a file as input, e.g. the FDA drugs in the examples folder:
vsflow preparedb -i fda.sdf -o fda_fps -f ecfp -r 3 -nb 4096
Now, you can use the databases with pre-calculated fingerprints for the similarity search by passing "from_db" to the -f/--fingerprint argument, for example as follows:
vsflow fpsim -i sample.sdf -o sample_sim.xlsx -d pdb_fps.vsdb -f from_db
VSFlow now reads the fingerprint parameters from the database file and calculates the fingerprints for the query molecules accordingly, then performs the similarity calculations.
The similarity search can be performed using different modes (std, all_tauts, can_taut, no_std) which can be specified with the -m/--mode argument. By default, the mode "std" is selected. To properly use the different modes, the database should be standardized with the "preparedb" functionality of VSFlow (see Page 'Prepare Databases'). Here is an overview what the respective modes does:
The different modes are useful to overcome issues arising from different molecule representations between query and database molecules. For example, the mode "can_taut" using the canonical tautomer may be useful to properly identify identical compounds despite being represented in different tautomeric states:
vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_canon.sdf -m can_taut
When using the mode "all_tauts", all possible tautomers (up to 100 by default, may be changed with -nt/--ntauts argument) of the query molecules are used for the similarity search. This may help to identify similar compounds particularly if the database is not standardized.
VSFlow contains a tool to screen compound databases based on shape similarity between query molecules and database molecules. For that, the database must contain molecules with 3D information, typically several conformers per molecule. The query molecules can be either 2D or 3D. If the queries are 2D, VSFlow generates one (or more, if desired) energy-minimized 3D conformer(s) for each query molecule. Conformers for the database molecules can be generated with the "preparedb" functionality of VSFlow, if necessary (see Page "Prepare Databases"). The shape screening is done by combining several functionalities implemented in the RDKit. The single steps are outlined in the following:
In Step 1, 3D conformers for the 2D query molecules are generated and optimized with the MMFF94 forcefield. A user-specified number of low energy conformer(s) (by default only the lowest energy conformer) are returned for the next steps. In Step 2, all conformers of each query molecule are aligned to all conformers of each database molecule (MMFF94 force field parameters or Crippen atomic logP contributions). In Step 3, for every conformer pair the shape similarity is calculated (TanimotoDist, TverskyShape or ProtrudeDist) and the most similar conformer pair for every query/database molecule pair is selected. In Step 4, for the most similar conformer pair selected in Step 3 a 3D pharmacophore fingerprint is generated and in step 5 the fingerprint similarity is calculated. In Step 6, a "Combo" score is calculated as average of the shape similarity and pharmacophore similarity. This combo score is used to rank the database molecules in Step 7.
The main usage of the tool may be the screening of query molecules in a single, biologically active conformation – for example ligands from the PDB database – against a database containing compounds in multiple conformations.
vsflow shape {arguments}
To display the help with all available arguments:
vsflow shape -h
Required:
Optional:
Navigate into the examples folder.
To perform a basic shape screening, simply specify an input file for the query molecules and a database file. The database file must contain 3D information for the molecules, the query file can also be 2D:
vsflow shape -i {.sdf, .xlsx, .csv, .smi, .txt, .ich} -d {.sdf, .sdf.gz, .vsdb}
The example file containing all approved drugs downloaded from ZINC database in the examples folder (fda.sdf) contains 3D information for most of the molecules (but not for all). It may be used as the database file, VSFlow will skip the 2D structures in the file. As query, the ligand from PDB entry 2BML (XED.sdf in the examples folder) in its bioactive conformation could be used:
vsflow shape -i XED.sdf -d fda.sdf -o XED_shapesim
It is sufficient to specify a prefix for the output file. The results are always written to an SD file and in case the query file contains multiple molecules, a separate output file for the results of each query molecule is always generated. In the above case, one file named XED_shapesim_1.sdf is generated containing the 10 most similar database compounds with the 3D coordinates of the preferred conformer. Additionally, an SD file containing the 3D coordinates of the query molecule is always generated (named XED_shapesim_1_query.sdf).
In the above example, only one conformer per molecule is present in the database file. In a typical scenario, the database may contain multiple conformers of the same molecule. VSFlow will interpret conformers in a query SD file as part of a single multi-conformer molecule as long as they are contiguous in the file and have the same canonical SMILES. You can create multiple conformers for a single molecule with the preparedb mode of VSFlow (see Page "Prepare Databases" for more information). The file fda_confs.vsdb in the example folder, containing 20 conformers per molecule (32300 in total), was generated as follows:
vsflow preparedb -i fda.sdf -o fda_confs -c -np 12
You can also try this file for a shape search:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim
The above search takes a couple of minutes because it is only run on a single core. It is generally recommended to parallelize the shape search on multiple cores using the -np/--nproc argument to speed things up, e.g. on 6 cores/threads:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 6
VSFlow takes care you do not use more cores than available on your system. Python's built-in multiprocessing tools are used in the first place for parallelization. However, to further speed things up, it is additionally possible to use all available threads of your machine for the 3D alignment step via the C++ code of RDKit. To do so, simply specify the --boost flag:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 6 --boost
Caveat: Please make sure you do not run other important stuff at the same time because your machine may slow down! Parallelizing only the alignment step via RDKit's C++ code is also possible when using only one core:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim --boost
You may see how the run time differs for the above examples on your machine.
By default, the 10 most similar compounds for each query molecule are written to the output SD file. This may be changed with -t/--top_hits argument, e.g. to return the 20 most similar compounds:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -t 20
The default method for the alignment of the query conformer(s) to the database conformers is via MMFF atom types and charges. This may be changed to use calculated Crippen atomic logP contributions for the alignment instead with the -a/--align_method argument:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -a crippen
The Combo score (see above) is used by default to identify the most similar compounds. The combo score is the mean value of the shape similarity and the 3D pharmacophore fingerprint similarity. Instead, it is also possible to use the shape similarity or 3D pharmacophore similarity on its own via the -s/--score argument:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 -s shape
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 -s pharmacophore
Instead of simply returning the n most similar compounds, it is also possible to provide a cutoff value (between 0 and 1) for the respective score using the -c/--cutoff argument:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 -s shape -c 0.5
In the above case, only database compounds with a Tanimoto shape similarity higher than 0.5 are written to the output file(s), if any.
The shape similarity is calculated according to the Shape Tanimoto metric by default. Instead, the Shape Protrude (shortcut: protr) or Shape Tversky metric (shortcut tver) may be used with the --shape_simi argument:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --shape_simi protr
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --shape_simi tver
When using the non-symmetrical Shape Tversky metric, the alpha parameter (default = 0.5, weighs the query molecule) and the beta parameter (default = 0.5, weighs the database molecule) can additionally be adjusted, e.g.:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --shape_simi tver --tver_alpha 0.95 --tver_beta 0.05
In a comparable manner, the metric used to calculate the 3D pharmacophore fingerprint similarity can be adjusted independently:
Simply pass the respective shortcut to the --pharm_simi argument, for example:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --pharm_simi dice
The pharmacophore fingerprint may be further customized regarding the pharmacophore features to be used for its generation. By default, the pharmacophore definitions by Gobbi et al. are used. It is also possible to use some basic or minimal definitions from the RDKit instead with the --pharm_feats argument, e.g.:
vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --pharm_feats basic
You can have a deeper look into the pharmacophore definition files (.fdef) in the resources folder in the repository to see which SMARTS are used to define pharmacophores in each case.
VSFlow offers the possibility to quickly visualize the shape screening results as PDF (--pdf flag) and/or PyMOL (--pymol) file:
vsflow -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --pdf --pymol
The PDF file contains 2D depictions of the molecules. Additionally, the combo score, the shape similarity coefficient and the 3D pharmacophore fingerprint similarity value are annotated. The PyMOL file visualizes the 3D overlay of the query molecule and the returned database molecules.
The query compounds can also be provided as 2D representations. VSFlow automatically recognizes the 2D structures and generates 3D conformers for them, e.g. it is in principle possible to specify a mixed 2D/3D input file as query (however, this make no real sense). By default, 100 conformers for each 2D query molecule are generated and optimized with the MMFF94 force field. Only the lowest energy conformer is used for the subsequent shape screening by default. This may be changed with the --nconfs and --keep_confs arguments:
vsflow shape -i sample2D.sdf -d fda.sdf -o sample_shapesim -np 12 --nconfs 300 --keep_confs 5
In the above example, 300 conformers are generated and optimized and the 5 conformers with the lowest energy are used for the subsequent shape screening.
To generate reproducible conformers, a seed for the random number generator can be specified with the --seed argument:
vsflow shape -i sample2D.sdf -d fda.sdf -o sample_shapesim -np 12 --seed 42
It is also possible to directly specify SMILES representations of the query molecules on the command line with the -smi/--smiles argument:
vsflow shape -smi "CNCCC1=C(C2=CC=CC=C2)C=CN=C1" -d fda.sdf -o sample_shapesim -np 12 --keep_confs 3
Please make sure to put the SMILES in double quotes!
As already outlined on Page "Prepare Databases", frequently used databases may be "integrated" into VSFlow. They are then stored in the folder "VSFlow_Databases" in the user's HOME directory ($HOME/VSFlow_Databases) or in the folder "Databases" in the VSFlow repository/installation folder ($Repository/Databases), e.g. the following will download all PDB ligands and store them as file named pdb.vsdb in the "VSFlow_Databases" folder in the HOME directory (see Page "Prepare Databases" for more information):
vsflow preparedb -d pdb -int pdb
You could also integrate the FDA drugs provided in the examples folder, additionally standardize them and generate fingerprints:
vsflow preparedb -i fda.sdf -int fda -s -f ecfp -np 6
You can now use the mode managedb to interact with the integrated databases.
vsflow managedb {arguments}
To show all arguments:
vsflow managedb -h
You can show all integrated databases (that means all .vsdb files in the folder "VSFlow_Databases" in the HOME directory) as follows:
vsflow managedb -s
This will show you a table listing the databases with their name/shortcut, date created, standardized yes or no, canonical tautomer included yes or no, number of conformers per molecule, calculated fingerprints, and total number of compounds.
You can also move .vsdb files manually to the folder. They are then also considered as "integrated" and are listed when using the above command.
One advantage of integrated databases is the fact that you can use them from throughout the system by only passing the shortcut/name to -d/--database argument, you do not need to specify the full path.
You can customize two paths to integrate integrate databases:
vsflow managedb --set_local new/path/to/integrate/database
The above command will change the local database path ($HOME/VSFlow_Databases) to the new path.
vsflow managedb --set_global new/path/to/integrate/database/globally
This above command will set a global database path. This may be useful if VSFlow is run on a server and some databases should be accessible for different users.
It is also possible to remove integrated databases directly using VSFlow. Simply pass the database shortcut to the --remove argument:
vsflow managedb --remove shortcut/name
This will also delete the file from the disk, so you could also manually delete the file instead.
If you have a preferred database, you can also set it as the default database:
vsflow managedb --set_default shortcut/name
Now, if you do not specify the -d/--database argument (in mode substructure, fpsim and shape), this database is used by default.
The following files are supported using the -i/--input argument:
The following files are supported using the -d/--database argument:
All files can be found in the examples
folder in this GitHub repository. For further explanation, please read the paper (in progress).
fda.sdf
: FDA-approved drugs from the ZINC database, >1600 mols2gqg_C_1N1.sdf
: from PDBvsflow substructure -sma "s:1:c:n:c:c:1" -d fda.sdf -o substructure.sdf --pdf
Searches for a thiazole substructure of the drugs. Generates sdf (substructure.sdf) as well as pdf (substructure.pdf) output file. For more information, see Chapter 3.
vsflow fpsim -d fda.sdf -o fingerprint.xlsx --pdf --simmap -smi "Cc1cccc(c1NC(=O)c2cnc(s2)Nc3cc(nc(n3)C)N4CCN(CC4)CCO)Cl"
Default parameters (see Chapter 4). Creates excel (fingerprint.xlsx) and pdf output files (fingerprint.pdf).
vsflow preparedb -i fda.sdf -np 8 -c -o fda_multiple_confs.vsdb
Multiple conformers needed for shape screening. Chapter 2 provides more information on how to prepare databases.
vsflow shape -i 2gqg_C_1N1.sdf -np 8 -d fda_multiple_confs.vsdb -o shape.sdf --pymol
Default parameters (see Chapter 5 for more details); creates two sd files and a pymol session file: query molecule (shape_1_query.sdf), found hits (shape_1.sdf), hits and query molecule (shape_1.pse).