Table of contents

Home
1-Installation-and-General-Usage
2-Prepare-Databases
3-Substructure-Search
4-Fingerprint-Similarity
5-Shape-Screening
6-Manage-Databases
7-Supported-File-Formats
8-Use-cases-from-publication

Home

Welcome to the VSFlow Wiki!

On the following pages, the installation and usage of VSFlow is described in detail !

VSFlow is an open-source command-line tool written in Python for the ligand-based virtual screening of large compound libraries. It includes a substructure-based, a fingerprint-based and a shape-based virtual screening tool. Additionally, it provides a tool to standardize compound libraries and generate conformers.

1-Installation-and-General-Usage

Installation instructions

First of all, you need a working installation of Anaconda (https://www.anaconda.com/products/individual) or Miniconda (https://conda.io/en/latest/miniconda.html). Both are available for all major platforms.

Second, you need to clone the VSFlow GitHub repository to your system or download the zip file and unpack it (in the following called the repository folder).

All following instructions assume working with a bash shell!

Navigate into the repository folder.

Now, you can install the required dependencies with the provided environment.yml file within the repository folder as follows:

conda env create --quiet --force --file environment.yml
conda activate vsflow

Alternatively, you can also create a new conda environment and install the dependencies manually:

conda create -n vsflow python=3.7
conda activate vsflow
conda install -c rdkit -c conda-forge -c viascience -c schrodinger rdkit xlrd xlsxwriter pdfrw fpdf pymol molvs matplotlib

The Python dependencies are:

Python = 3.7
RDKit >= 2019.09.3
FPDF >= 1.7.2
PDFRW >= 0.4
XlsxWriter >= 1.2.7
Xlrd >= 1.2.0
PyMOL >= 2.3.4
Molvs >= 0.1.1
Matplotlib >= 3.3.4

Now, you can install VSFlow as follows:

pip install . --use-feature=in-tree-build

General Usage

Always make sure the conda environment is activated. Now you can run VSFlow as follows:

vsflow {mode} {arguments}

For example, the following command will display all included modes (substructure, fpsim, shape, preparedb, managedb) and the general usage:

vsflow -h

To display all possible arguments for a particular mode, type as follows:

vsflow {mode} -h

For example, with the following command all arguments for mode substructure are shown:

vsflow substructure -h

2-Prepare-Databases

Databases

VSFlow contains a tool to prepare compound libraries for virtual screening. It allows for standardization of the molecules, generation of fingerprints and generation of multiple conformers. As input files, sdf/sdf.gz, csv and excel (containing molecules as SMILES or InChI) and a bunch of different text files (.smi, sma, .ich, .tsv, .txt containing SMILES/InChI) are supported. The output file is a "virtual screening database" (.vsdb) file. The vsdb file is a pickle file containing all molecules in a special dictionary format ready to use with VSFlow.

General Usage

Always make sure the conda environment is activated.

vsflow preparedb {arguments}

The following command will display the help with all available arguments:

vsflow preparedb -h

Arguments:

Required:

-i, --input
specify path of input file

-d, --download
specify shortcut for database that should be downloaded [chembl or pdb]

Optional:

-o , --output
specify name (and path) of output file without file extension [default: prep_database]
-int , --integrate
specify shortcut for database; saves database to $HOME/VSFlow_Databases
-intg, --int_global
specify shortcut for database; stores prepared database within the repository folder
-s, --standardize
standardizes molecules, removes salts and associated charges
-can, --canonicalize
if specified, the canonical tautomer for every molecule is generated and stored in the output database file
-c, --conformers
generates multiple 3D conformers for database molecules
-np , --nproc
specify number of processors to run application in multiprocessing mode
-f, --fingerprint
f specified, the selected fingerprint is generated for database molecules [rdkit, ecfp, fcfp, ap, tt, maccs]
-r, --radius
specify radius for circular fingerprints ecfp and fcfp [default: 2]
-nb, --nbits
specify bit size of fingerprints [default: 2048]
--no_chiral
if specified, chirality of molecules will not be considered for fingerprint generation
--max_tauts
maximum number of tautomers to be enumerated during standardization process [default: 100]
--nconfs
maximum number of conformers generated, [default: 20]
--rms_thresh
if specified, only those conformations out of --nconfs that are at least this different are retained (RMSD calculated on heavy atoms)
--seed
specify seed for random number generator, for reproducibility
--boost
distributes conformer generation on all available threads of your cpu
--header
Specify number of row in csv/xlsx file containing the column names, if not automatically recognized [e.g. 1 for first row]
--mol_column
Specify name (or position) of mol column [SMILES/InChI] in csv/xlsx file, if not automatically recognized [e.g. 'SMILES' or '1' (for first column)]
--delimiter
Specify delimiter of csv file, if not automatically recognized
-h, --help
show this help message and exit

Detailed Usage

Navigate into the examples folder.
For the following examples, you can download all ligands from the PDB database (with ideal geometries) or the Chembl database directly within VSFlow as follows (you need a working internet connection), e.g.:

vsflow preparedb -d pdb -o pdb_ligs

With the above command, the file containing the pdb ligands is automatically downloaded and written to the output file named pdb_ligs.vsdb in the examples folder. You can perform perform all operations described in the following using this file as input.

However, to quickly demonstrate the usage, you can specify the SD file containing all approved FDA drugs (approx. 1600 compounds, downloaded from public Zinc database: https://zinc.docking.org/substances/subsets/fda/) in the examples folder as input:

vsflow preparedb -i fda.sdf -o fda_drugs

In the above example, the input is simply converted to a vsdb file, without performing any standardization or conformer/fingerprint generation. It is generally recommended to convert frequently used compound libraries to vsdb files because loading speed is typically much faster. If you do not provide an output file name, the output is written to prep_database.vsdb by default. It is not necessary to provide the file extension for the output file, the databases will always be saved as pickle file with the file extension .vsdb. However, it is essential to provide the file extension for the input file since VSFlow recognizes the file format from its file extension. Other supported input file formats are text files (.smi, .sma, .ich, .txt/.txt.gz, .csv/.csv.gz, .tsv/.tsv.gz), gzipped SD files (.sdf.gz) and excel files (.xlsx).

By specifying the -s/--standardize flag, all compounds in the database are additionally standardized:

vsflow preparedb -i fda.sdf -o fda_std -s

Standardization includes: standardization according to molvs [1] rules, disconnecting metals and salts and removing charges. Additionally, it is possible to generate the canonical tautomer (-can/--canonicalize argument) and store it in addition to the standardized molecule in the database file:

vsflow preparedb -i fda.sdf -o fda_std -s -can

Standardization and canonicalization is generally recommended and required to use some screening capabilities of VSFlow properly.

Since tautomer generation can be time-consuming for larger compound databases, parallelization is possible to speed things up via the -np/--nproc argument, e.g. by running on 6 cores/threads (probably available on most modern machines):

vsflow preparedb -i fda.sdf -o fda_std -s -can -np 6

Parallelization is done via Python's built-in multiprocessing module.

By specifying the -f/--fingerprint argument, the selected fingerprints are generated for all molecules and stored within the output database file:

vsflow preparedb -i fda.sdf -o fda_std_fps -s  -can -f ecfp -np 6

The following fingerprints are supported:

Fingerprints

fcfp: FCFP-like Morgan fingerprint from the RDKit (extended connectivity fingerprint with pharmacophore feature definitions, circular fingerprint)
ecfp: ECFP-like Morgan fingerprint from the RDKit (extended connectivity fingerprint, circular fingerprint)
rdkit: RDKit fingerprint (Daylight-like fingerprint, substructure fingerprint)
ap: Atom Pairs fingerprint from the RDKit
tt: Topological Torsion fingerprint from the RDKit
maccs: SMARTS-based implementation of the 166 public MACCS key from the RDKit

Via the -r/--radius argument, the radius (default: 2) for circular fingerprints ecfp and fcfp can be changed. The -nb/--nbits argument changes the bit size of the fingerprint (default: 2048) and if --no_chiral argument, chirality is ignored, e.g.:

vsflow preparedb -i fda.sdf -o fda_fps -f ecfp -r 3 -nb 4096 --no_chiral

By specifying the -c/--conformers flag, 3D conformers for all database compounds are generated:

vsflow preparedb -i fda.sdf -o fda_confs -c

You can optionally use the --seed argument to specify a seed for the random number generator for reproducibility purposes:

vsflow preparedb -i fda.sdf -o fda_std_confs -c --seed 42

Caveat: 3D information of molecules read from an input SD file are overwritten when generating conformers!
By default, 20 conformers per molecule are generated. This may be changed using the --nconfs argument. By specifying the --rms_thresh argument, only those conformers out of --nconfs which have an RMSD deviation (calculated on heavy atoms) greater than the specified value are retained, e.g.:

vsflow preparedb -i fda.sdf -o fda_std_confs -c --nconfs 10 --rms_thresh 0.3

With the above statement, all compounds are standardized, then 10 conformers per compound are generated and only those with an RMSD deviation greater than 0.3 are retained. The --rmsd_thresh flag is useful to keep only those conformers with significant differences.

Since conformer generation can be time-consuming, it is reasonable to parallelize tasks using the -np/--nproc argument. Here you can specify the number of cores/threads of your system to be used for parallelization. VSFlow takes care you don't specify more threads than available. With the following command, molecules are standardized, 30 conformers are generated and fingerprints are calculated using 6 cores/threads:

vsflow preparedb -i fda.sdf -o fda_s_confs_fps -s -c -f ecfp --nconfs 30 -np 6

To further speed up calculations, the conformer generation itself can be further distributed to all available threads of the system via the C++ code of RDKit with the --boost flag:

vsflow preparedb -i fda.sdf -o fda_std_confs -s -c -f ecfp -np 6 --boost

This will distribute the calculations to 6 threads but will additionally use all other available threads to generate the conformers. You may see how the run time differs for the above examples on your machine.
Caveat: Make sure you do not run other important stuff when parallelizing to all available threads since this may slow down your machine!

Integration of Databases

Instead of specifying the -o/--output argument, it is also possible to "integrate" the database into VSFlow using the -int/--integrate and -intg/--int_global arguments. Do not provide a file extension, just specify a shortcut name:

vsflow preparedb -i fda.sdf -int fda -s -c

With the above command, the prepared database is saved as file named fda.vsdb to the folder "VSFlow_Databases" in the user's HOME directory. VSFlow can now access the database from throughout the system, the user only needs to pass the shortcut name to the -d/--database argument, e.g. for a substructure search (see page Substructure Search for more information):

vsflow substructure -smi C1=CN=CC=C1 -d fda -o pyr_subs.sdf

When the -intg/--int_global argument is specified instead, the prepared database is also saved to the folder $Home/VSFlow_Databases by default:

vsflow preparedb -i fda.sdf -intg fda -s

Both paths can be changed in the mode "managedb" (see Page "Manage Databases" for more information).
It may be useful to change the global database path (-intg/--int_global) if VSFlow is run on a server with multiple users and some databases should be accessible for all users.

3-Substructure-Search

Substructure Search

VSFlow can perform a substructure search in compound libraries (databases). The database file have the format .sdf, .sdf.gz or vsdb (see Page "Prepare Databases" how to generate a .vsdb file). The query molecules/patterns can be provided in multiple formats (.sdf, .xlsx, .csv , .tsv, .smi, .sma, .ich) or can be directly specified as SMILES or SMARTS on the command-line. The implementation of the substructure search is based on the "GetSubstructMatches" functionality available for RDKit Mol objects.

General Usage

vsflow substructure {arguments}

To display the help with all available arguments:

vsflow substructure -h

Arguments

Required:

-i , --input
specify path of input file for query molecules
OR
-smi , --smiles
specify SMILES string on command line in double quotes
OR
-sma , --smarts
specify SMARTS string on command line in double quotes
-d , --database
specify path/name of the database file [sdf, sdf.gz or vsdb] or specify the shortcut for an integrated database (not required if a default database is set, see Page "Manage Databases" for more information)

Optional:

-o , --output
specify name/path of output file, supported formats are .sdf, .xlsx, .csv [default: substructure.sdf]
-m , --mode
choose a mode for substructure search [std, all_tauts, can_taut, no_std]
-np , --nproc
specify the number of processors to be used to run the application in multiprocessing mode
-fm, --fullmatch
if specified, only full matches are returned
-p, --properties
if specified, calculated molecular properties (e.g. MW, TPSA etc.) are written to the output files
-nt , --ntauts
maximum number of query tautomers to be enumerated when mode (-m/--mode) all_tauts is used [default: 100]
-mf, --multfile
if specified, a separate output file for every query molecule is generated
--filter
specify molecular property to filter screening results, see documentation
--mol_column
specify name (or position) of mol column [SMILES/InChI] in csv/xlsx file, if not automatically recognized
--delimiter
specify delimiter of csv file, if not automatically recognized
--pdf
generate a pdf file in addition to the output file(s) with visualization of the 2D structures and substructure highlighting
--combine
if specified, multiple arguments provided via -smi/--smiles or -sma/--smarts are combined to one query
-h, --help
show this help message and exit

Detailed Usage

Navigate into the examples folder.
In the following examples, the database file prepared on the Page 'Prepare Databases' (fda_std.vsdb) downloaded from the Zinc database (https://zinc.docking.org/substances/subsets/fda/) and the file generated by downloading the ligands from the PDB database are used (pdb_ligs.vsdb, see Page "Prepare Databases" for more information). The sample query files used contain three fragment-like compounds.

To perform a basic substructure search, specify an input file containing the query molecules and a database file:

vsflow substructure -i {.sdf, .xlsx, .csv, .smi, .sma, .ich, .txt} -d {.sdf, .sdf.gz, .vsdb}

To try it, use the following example:

vsflow substructure -i fragments.sdf -d fda.sdf

If no name for the output file (-o, --output) is specified, the output containing the matching database compounds is written to an SD file named substructure.sdf by default.

In addition to an SD file, the substructure results can also be written to a csv or xlsx file. Just specify the respective file extension, e.g.:

vsflow substructure -i fragments.sdf -d pdb_ligs.vsdb -o fragment_subs_pdb.xlsx

It is also possible to specify the query directly as SMILES (-smi/--smiles) or SMARTS (-sma/--smarts) on the command-line:

vsflow substructure -smi "FC1=CC=CC=C1" -d pdb_ligs.vsdb -o smiles_subs_pdb.csv

vsflow substructure -sma "[C,N]1CCCCC1" -d pdb_ligs.vsdb -o smarts_subs_pdb.sdf

Please make sure to put the SMILES or SMARTS in double quotes!

It is also possible to specify multiple SMILES or SMARTS at the same time, e.g.:

vsflow substructure -smi "FC1=CC=CC=C1" -smi "C1=CN=CC=C1" -d pdb_ligs_std.vsdb -o smiles_subs_pdb.csv

With the above command, the two smiles are treated as separate queries. By specifying the --combine flag, the two smiles are treated as one query. Only molecules in the database containing both queries as substructure are now written to the output file:

vsflow substructure -smi "FC1=CC=CC=C1" -smi "C1=CN=CC=C1" -d pdb_ligs.vsdb -o 2subs_pdb.csv --combine

You can also specify text files containing one SMILES (.smi file) or SMARTS pattern (.sma file) per line. Have a closer look at the files in the examples folder, e.g.:

vsflow substructure -i sample.sma -d pdb_ligs.vsdb -o sample_sub_pdb.csv

If the query molecules are provided as xlsx or csv file, VSFlow tries to automatically detect the molecule column (containing SMILES or InChI) as well as the separator (for csv files). In case it does not work at some point, an error message is displayed asking you to specify the parameters manually. You can do so via the --mol_column and --delimiter argument:

vsflow substructure -i fragments.csv -d pdb_ligs.vsdb --mol_column smiles --delimiter ","

When providing multiple query molecules at the same time, it may be useful to generate a dedicated output file for every query molecule with the -mf/--multfile flag:

vsflow substructure -i fragments.sdf -d pdb_ligs.vsdb -o fragment_subs_pdb.xlsx -mf

In the above case, three excel files (_1 - _3) containing the results for the respective query molecule are generated.

VSFlow also offers the possibility to directly filter the screening results by some physicochemical properties with the --filter argument, e.g.:

vsflow substructure -i fragments.csv -d pdb_ligs.vsdb -o frags_subs_filt.sdf --filter mw_300

With the above command, only those substructure results with a maximum molecular weight of 300 Da are written to the output file. In the following table, the supported properties and their shortcuts are summarized:

Property table

property	shortcut	example
Molecular Weight in Da	mw	mw_300
cLogP value	logp	logp_1.5
number of H donors	nhdo	nhdo_3
number of H acceptors	nhac	nhac_5
number of rotatable bonds	nrob	nrob_5
number of aromatic rings	naro	naro_3
number of heteroaromatic rings	nhet	nhet_3
topological polar surface area in A²	tpsa	tpsa_80

Make sure to separate the shortcut for the property and the value with an underscore (see examples in the table). Please note that the specified values are always considered as upper limit, e.g. nhac_5 means that only compounds with a maximum of 5 hydrogen acceptors are considered. It is also possible to combine multiple properties, e.g.:

vsflow substructure -i fragments.sdf -d fda_std.vsdb -o frags_fda_filt.sdf --filter mw_400 --filter logp_3

With the above statement, only those database compounds containing the respective substructure and possessing a maximum molecular weight of 400 Da and a maximum clogP of 3 are written to the output file.

It is also possible to return only full matches, e.g. only compounds fully similar to the queries if any (in fact it is no longer a substructure search then, but perhaps useful sometimes), by specifying the -fm/--fullmatch flag:

vsflow substructure -i fragments.sdf -d pdb_ligs.vsdb -o frags_pdb_full.sdf -fm

In addition to the sdf/xlsx/csv output files, VSFlow can also generate a PDF file for the substructure results with 2D depiction of the molecule structure, highlighting of the substructure and annotation of molecular properties. Simply specify the --pdf flag:

vsflow substructure -i fragments.sdf -d fda_std.vsdb -o frags_subs_fda.sdf --pdf

If the -p/--properties is specified, all properties listed in the above table are additionally written to all output files:

vsflow substructure -i fragments.sdf -d fda_std.vsdb -o frags_subs_fda.sdf --pdf -p

For larger databases and numerous query molecules, it might be useful to parallelize the substructure search with the -np/--nproc argument on multiple cores/threads. VSFlow makes sure you do not use more threads than available:

vsflow substructure -i fragments.csv -d pdb_ligs.vsdb -o frags_subs_2.sdf -np 12

The above command performs the substructure search using 12 cores/threads of your system. Parallelization is done with Python's built-in multiprocessing capabilities.

Usage of -m/--mode argument

The substructure search can be performed using different modes (std, all_tauts, can_taut, no_std) which can be specified with the -m/--mode argument. By default, the mode "std" is selected. To properly use the different modes, the database should be standardized with the "preparedb" functionality of VSFlow (see Page 'Prepare Databases'). Here is an overview what the respective mode does:

std: standardizes (molvs standardization, removing of salt and neutralizing charges) the query molecules, uses standardized database molecules for substructure search (if database is standardized)
all_tauts: standardizes the query molecules and generates a maximum of n tautomers (default: n=100, can be changed with -nt/--ntauts argument) for all query molecules, performs substructure search with all query tautomers on standardized database molecules (if database is standardized)
can_taut: standardizes the query molecules and generates the canonical tautomer (according to molvs) for each query molecule, performs the substructure search with the canonical query tautomer on the canonical tautomer of the database molecules (if database is standardized/canonicalized)
no_std: no standardization of query molecules

The different modes are useful to overcome issues arising from different molecule representations between query and database molecules. For example, the mode "can_taut" using the canonical tautomer may be useful to properly identify identical compounds despite being represented in different tautomeric states. To do so, run the substructure search in mode "can_taut" and return only full matches (see also above):

vsflow substructure -i sample.csv -d pdb_ligs.vsdb -o sample_full.sdf -fm -m can_taut

When using the mode "all_taut", all possible tautomers (up to 100 by default, may be changed with -nt/--ntauts argument) of the query molecules are used for the substructure search. This may also help to identify matching compounds particularly if the database is not standardized.

4-Fingerprint-Similarity

Fingerprint Similarity Search

VSFlow includes a tool to search for similar compounds in large databases via fingerprint-based molecular similarity. It contains all fingerprints currently implemented in the RDKit (Morgan, RDKit, Topological Torsion and Atom Pairs fingerprint and MACCS keys) and different similarity measures (Tanimoto, Tversky, Cosine, Dice, Sokal, Russel, Kulczynski and McConnaughey similarity). It is possible to search for the n-nearest similar compounds as well as to provide a cutoff value for the similarity.

General Usage

vsflow fpsim {arguments}

To display the help with all available arguments:

vsflow fpsim -h

Arguments

-h, --help
show this help message and exit

Required:

-i, --input
specify path of input file for query molecules
OR
-smi , --smiles
specify SMILES string on command line in double quotes
-d , --database
specify path/name of the database file [sdf, sdf.gz or vsdb] or specify the shortcut for an integrated database

Optional:

-o , --output
specify path/name of output file [default: fingerprint.sdf]
-m , --mode
choose a mode for similarity search [std, all_tauts, can_taut, no_std]
-np , --nproc
specify the number of processors to be used to run the application in multiprocessing mode
-f , --fingerprint
specify fingerprint to be used [rdkit, ecfp, fcfp, ap, tt, maccs, from_db], [default: fcfp]
-r , --radius
specify radius of circular fingerprints ecfp and fcfp [default: 2]
-nb , --nbits
specify bit size of fingerprints [default: 2048]
-s , --similarity
specify fingerprint similarity metric to be used [tan, dice, cos, sok, russ, kulc, mcco, tver], [default: tan]
-t , --top_hits
specify maximum number of molecules with highest similarity to keep [default: 10]
-c , --cutoff
specify cutoff value for similarity coefficient
-nt , --ntauts
maximum number of query tautomers to be enumerated when mode (-m/--mode) all_tauts is used [default: 100]
-mf, --multfile
if specified, a separate output file for every query molecule is generated
-p, --properties
if specified, calculated molecular properties (e.g. MW, TPSA etc.) are written to the output files
--filter
specify molecular property to filter screening results, see documentation
--mol_column
specify name (or position) of mol column [SMILES/InChI] in csv/xlsx file, if not automatically recognized
--delimiter
specify delimiter of csv file if not automatically recognized
--pdf
generate a pdf file in addition to the output file(s) with visualization of the 2D structures and annotation of similarity coefficient
--simmap
if specified, RDKit similarity maps for supported fingerprints are generated in the pdf file
--no_chiral
if specified, chirality will not be considered for fingerprint generation
--tver_alpha
specify alpha parameter (weighs database molecule) for Tversky similarity [default: 0.5]
--tver_beta
specify beta parameter (weighs query molecule) for Tversky similarity [default: 0.5]

Detailed Usage

Navigate into the examples folder.
To perform a basic similarity search, simply specify an input file containing the query molecules (-i/--input) and a database file (-d/--database):

vsflow fpsim -i {.sdf, .xlsx, .csv, .smi, .sma, .ich} -d {.sdf, .sdf.gz, .vsdb}

You can try it with some example files provided in the repository folder (see Pages "Prepare Databases" and "Substructure Search" for more details):

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb

If no output file is specified, the results are written to an SD file named fingerprint.sdf. By default, a 2048 bit feature Morgan Fingerprint with radius 2 (FCFP4-like fingerprint) is used and the similarity is calculated with the Tanimoto metric. The 10 most similar compounds per query molecule are written to the output file by default. All of this may be changed with the appropriate arguments, e.g.:

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -f ecfp -r 3 -nb 4096 -s dice -t 20 -o sample_sim.xlsx

With the above statement, a non-feature Morgan Fingerprint (-f/--fingerprint) with 4096 bits (-nb/--nbits) and radius 3 (-r/--radius), a so called ECFP6-like fingerprint, is used. Similarities are calculated with the Dice metric (-s/--similarity) and the 20 most similar compounds (-t/--top_hits) are written to the output excel file named sample_sim.xlsx (-o/--output). In the following, the implemented fingerprints and similarity metrics and their shortcuts are listed:

Fingerprints

fcfp: FCFP-like Morgan fingerprint from the RDKit (extended connectivity fingerprint with pharmacophore feature definitions, circular fingerprint)
ecfp: ECFP-like Morgan fingerprint from the RDKit (extended connectivity fingerprint, circular fingerprint)
rdkit: RDKit fingerprint (Daylight-like fingerprint, substructure fingerprint)
ap: Atom Pairs fingerprint from the RDKit
tt: Topological Torsion fingerprint from the RDKit
maccs: SMARTS-based implementation of the 166 public MACCS key from the RDKit

Similarity Metrics

tan: Tanimoto Similarity
dice: Dice Similarity
cos: Cosine Similarity
sok: Sokal Similarity
russ: Russel Similarity
kulc: Kulczynski Similarity
mcco: McConnaughey Similarity
tver: Tversky Similarity

It is also possible to provide a cutoff value (-c/--cutoff) for the respective similarity metric. The cutoff value must be between 0 and 1. Only compounds with a similarity higher than the cutoff value are now returned and written to the output file, e.g.:

vsflow fpsim -i sample.sdf -d fda_std.vsdb -o sample_fda_sim.sdf -s dice -c 0.6

In the above case, only database molecules with a Dice similarity greater than 0.6 are written to the output file, if any (there should be no, set cutoff to 0.5 to get some results).

When using the non-symmetrical Tversky similarity, the Tversky alpha parameter (--tver_alpha, weighs the query molecule, default is 0.5) and the Tversky beta parameter (--tver_beta, weighs the database molecule, default is 0.5) can additionally be adjusted:

vsflow fpsim -i sample.sdf -d fda_std.vsdb -o sample_fda_tver.sdf -s tver --tver_alpha 0.7 --tver_beta 0.3

When providing numerous queries at the same time, it might be useful to generate separate output files for each query molecule with the -mf/--multfile flag:

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_sim_pdb.xlsx -mf

Instead of SD files, the query molecules may be also read from csv or excel files (containing the molecule as SMILES or InChI representations). VSFlow will try to recognize the separator (for csv files) as well as the column containing the SMILES/InChI automatically. However, if things go wrong, it is possible to specify all necessary parameters separately:

vsflow fpsim -i sample.csv -d pdb_ligs.vsdb --delimiter ";" --mol_column smiles

SMILES (-smi/--smiles) can also be specified directly on the command-line, e.g.:

vsflow fpsim -smi "NCCC1=C(CCC2CCCC2)C=CN=C1" -d pdb_ligs.vsdb -o smiles_sim_pdb.xlsx

Please make sure to put the SMILES in double quotes!

It is also possible to specify multiple SMILES at the same time, e.g:

vsflow fpsim -smi "NCCC1=C(CCC2CCCC2)C=CN=C1" -smi "OC1=C(CC2CCCCC2)C=NC=C1" -d fda.sdf -o fda_sim.sdf

VSFlow also offers the possibility to directly filter the screening results by some common physicochemical properties using the --filter argument. The following table summarizes the supported properties and their shortcuts together with some examples:

Property table

property	shortcut	example
Molecular Weight in Da	mw	mw_300
cLogP value	logp	logp_1.5
number of H donors	nhdo	nhdo_3
number of H acceptors	nhac	nhac_5
number of rotatable bonds	nrob	nrob_5
number of aromatic rings	naro	naro_3
number of heteroaromatic rings	nhet	nhet_3
topological polar surface area in A²	tpsa	tpsa_80

Please make sure to always separate the shortcut and the value with an underscore, as shown in the examples in the table. Please note that the specified values are always considered as upper limit, e.g. nhac_5 means that only compounds with a maximum of 5 hydrogen acceptors are considered. In the following example, only those molecules with a Dice similarity higher than 0.5 and with a maximum cLogP of 2 are written to the output file, if any:

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -s dice -c 0.5 --filter logp_2

You can quickly check if the filter worked by writing all calculated properties from the table to the output files (-p/--properties flag, see table above), in this case an excel file for convenience:

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_sim_filt.xlsx -s dice -c 0.5 -p --filter logp_2

It is also possible to specify multiple properties at the same time, e.g.:

vsflow fpsim -i sample.sdf -d fda.sdf -o sample_sim_filt.xlsx --filter logp_2 --filter mw_400

By default, the chirality of molecules is considered when generating fcfp, ecfp, atom pairs and topological fingerprints, e.g. two enantiomers will have different fingerprints. This can be changed with the --no_chiral flag:

vsflow fpsim -i sample.sdf -d fda.sdf -o --no_chiral

VSFlow offers the possibility to visualize the screening results as PDF file (--pdf flag) with 2D depictions of the molecules. In the following example, a PDF file containing the screening results along with some annotations, including the similarity value, the used fingerprint and the used similarity metric is generated in addition to the SD file:

vsflow fpsim -i sample.sdf -d fda.sdf -o sample_sim.sdf --pdf

If the --simmap flag is additionally specified, RDKit similarity maps are generated along with the 2D depiction of the molecule in the PDF file:

vsflow fpsim -i sample.sdf -d fda.sdf -o sample_sim_maps.sdf --pdf --simmap

Note: similarity maps can not be generated for MACCS keys and when using the Tversky similarity metric. Instead, only a 2D depiction of the molecule is written to the PDF file.

With the -p/--properties flag, all properties listed in the above table are additionally annotated for every results molecule in all output files:

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_sim_maps.sdf -p --pdf --simmap

When screening numerous query molecules in large compound databases, it might be useful to parallelize the similarity search on multiple cores/threads with the -np/--nproc argument, e.g.:

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -np 12

Python's built-in multiprocessing module is used for parallelization.

Using pre-calculated fingerprints

For large, frequently used compound databases it might be useful to calculate the respective fingerprint for each compound only once and store it along with the molecule in the database. This will speed-up subsequent similarity searches since only the fingerprints for the query molecules have to be generated during the search. You can use the mode preparedb of VSFlow to pre-calculate fingerprints and store them within the virtual screening database (.vsdb) file (see Page "Prepare Databases" for more information). As an example, you can directly download (you need a working internet connection) all ligands from the PDB database and calculate the respective fingerprints as follows (see Page "Prepare Databases" for more information):

vsflow.py preparedb -d pdb -o pdb_fps -f ecfp -r 3 -nb 4096

The above command will download the PDB ligands, generate a 4096 bit ECFP-6-Morgan fingerprint for all ligands and store the fingerprint along with the molecule in a virtual screening database file named pdb_fps.vsdb in the examples folder.

Of course you can also use a file as input, e.g. the FDA drugs in the examples folder:

vsflow preparedb -i fda.sdf -o fda_fps -f ecfp -r 3 -nb 4096

Now, you can use the databases with pre-calculated fingerprints for the similarity search by passing "from_db" to the -f/--fingerprint argument, for example as follows:

vsflow fpsim -i sample.sdf -o sample_sim.xlsx -d pdb_fps.vsdb -f from_db

VSFlow now reads the fingerprint parameters from the database file and calculates the fingerprints for the query molecules accordingly, then performs the similarity calculations.

Usage of -m/--mode argument

The similarity search can be performed using different modes (std, all_tauts, can_taut, no_std) which can be specified with the -m/--mode argument. By default, the mode "std" is selected. To properly use the different modes, the database should be standardized with the "preparedb" functionality of VSFlow (see Page 'Prepare Databases'). Here is an overview what the respective modes does:

std: standardizes query molecules (molvs standardization, removing of salt and neutralizing charges), uses standardized database molecules for similarity search if database is standardized
all_tauts: standardizes query molecules and generates a maximum of n tautomers (default: n=100, can be changed with -nt/--ntauts argument) for all query molecules, performs similarity search with all query tautomers on standardized database molecules if database is standardized
can_taut: standardizes query molecules and generates the canonical tautomer (according to molvs) for each query molecule, performs the similarity search with the canonical query tautomer on the canonical tautomer of the database molecules if database is standardized
no_std: no standardization of query molecules

vsflow fpsim -i sample.sdf -d pdb_ligs.vsdb -o sample_canon.sdf -m can_taut

When using the mode "all_tauts", all possible tautomers (up to 100 by default, may be changed with -nt/--ntauts argument) of the query molecules are used for the similarity search. This may help to identify similar compounds particularly if the database is not standardized.

5-Shape-Screening

Shape-based Screening

VSFlow contains a tool to screen compound databases based on shape similarity between query molecules and database molecules. For that, the database must contain molecules with 3D information, typically several conformers per molecule. The query molecules can be either 2D or 3D. If the queries are 2D, VSFlow generates one (or more, if desired) energy-minimized 3D conformer(s) for each query molecule. Conformers for the database molecules can be generated with the "preparedb" functionality of VSFlow, if necessary (see Page "Prepare Databases"). The shape screening is done by combining several functionalities implemented in the RDKit. The single steps are outlined in the following:

Step 1 (if necessary): Generation of 3D conformers (RDKit ETKDGv3) and subsequent optimization and energy-minimization (MMFF94 force field) of the query molecules (if queries are 2D)
Step 2: Alignment of all conformers of each query molecule to all conformers of every database molecule based on either MMFF94 forcefield parameters or Crippen logP contributions (RDKit Open3DAlign)
Step 3: Calculating the shape similarity between query and database molecule for all alignments (RDKit rdShapeHelpers), select the most similar conformers for each query/database molecule pair
Step 4: Calculating a 3D pharmacophore fingerprint for each most similar conformer pair (RDKit Pharm2D)
Step 5: Calculating the similarity between each most similar conformer pair based on the calculated 3D pharmacophore fingerprints
Step 6: Combining shape similarity and 3D pharmacophore similarity to a Combo score ((shape similarity + pharmacophore similarity) / 2)
Step 7: Sorting database molecules based on Combo score

In Step 1, 3D conformers for the 2D query molecules are generated and optimized with the MMFF94 forcefield. A user-specified number of low energy conformer(s) (by default only the lowest energy conformer) are returned for the next steps. In Step 2, all conformers of each query molecule are aligned to all conformers of each database molecule (MMFF94 force field parameters or Crippen atomic logP contributions). In Step 3, for every conformer pair the shape similarity is calculated (TanimotoDist, TverskyShape or ProtrudeDist) and the most similar conformer pair for every query/database molecule pair is selected. In Step 4, for the most similar conformer pair selected in Step 3 a 3D pharmacophore fingerprint is generated and in step 5 the fingerprint similarity is calculated. In Step 6, a "Combo" score is calculated as average of the shape similarity and pharmacophore similarity. This combo score is used to rank the database molecules in Step 7.

The main usage of the tool may be the screening of query molecules in a single, biologically active conformation – for example ligands from the PDB database – against a database containing compounds in multiple conformations.

General Usage

vsflow shape {arguments}

To display the help with all available arguments:

vsflow shape -h

Arguments

-h, --help
show this help message and exit

Required:

-i , --input
specify path/name of input file
OR
-smi , --smiles
specify SMILES string on command line in double quotes
-d , --database
specify path of the database file [.sdf/.sdf.gz or .vsdb] or specify the shortcut for an integrated database

Optional:

-o , --output
specify name of output file [default: shape.sdf]
-np , --nproc
Specify the number of processors to run the application in multiprocessing mode
-t , --top_hits
Maximum number of molecules with highest score to keep [default: 10]
-a , --align_method
select method for molecular alignment [mmff, crippen], [default: mmff]
-s , --score
select score to be used to rank the results
-c , --cutoff
if specified, only molecules with score above cutoff value are written to the output files(s)
--seed
specify seed for random number generator, for reproducibility
--keep_confs
if queries are 2D: number of conformations per query molecule to keep after energy minimization [default: 1]
--nconfs
if queries are 2D: maximum number of query conformations to be enumerated [default: 100]
--boost
distributes query conformer generation and 3D alignment on all available cores/threads of your system
--pharm_feats
select pharmacophore feature definitions to be used for calculation of 3D pharmacophore fps [gobbi, basic, minimal], [default: gobbi]
--shape_simi
specify measure to be used to compare shape similarity [tan, protr, tver], [default: tver]
--fp_simi
specify measure to be used to calculate similarity from 3D pharmacophore fps[tan, dice, cos, sok, russ, kulc, mcco, tver], [default: tan]
--tver_alpha
specify alpha parameter (weighs database molecule) for Tversky similarity [default: 0.5]
--tver_beta
specify beta parameter (weighs query molecule) for Tversky similarity [default: 0.5]
--pdf
generate a pdf file for the results with 2D structures and annotations
--pymol
generate PyMOL file with 3D conformations for results
--mol_column
specify name (or position) of mol column [SMILES/InChI] in csv/xlsx file, if not automatically recognized
--delimiter
specify delimiter of csv file if not automatically recognized

Detailed Usage

Navigate into the examples folder.
To perform a basic shape screening, simply specify an input file for the query molecules and a database file. The database file must contain 3D information for the molecules, the query file can also be 2D:

vsflow shape -i {.sdf, .xlsx, .csv, .smi, .txt, .ich} -d {.sdf, .sdf.gz, .vsdb}

The example file containing all approved drugs downloaded from ZINC database in the examples folder (fda.sdf) contains 3D information for most of the molecules (but not for all). It may be used as the database file, VSFlow will skip the 2D structures in the file. As query, the ligand from PDB entry 2BML (XED.sdf in the examples folder) in its bioactive conformation could be used:

vsflow shape -i XED.sdf -d fda.sdf -o XED_shapesim

It is sufficient to specify a prefix for the output file. The results are always written to an SD file and in case the query file contains multiple molecules, a separate output file for the results of each query molecule is always generated. In the above case, one file named XED_shapesim_1.sdf is generated containing the 10 most similar database compounds with the 3D coordinates of the preferred conformer. Additionally, an SD file containing the 3D coordinates of the query molecule is always generated (named XED_shapesim_1_query.sdf).

In the above example, only one conformer per molecule is present in the database file. In a typical scenario, the database may contain multiple conformers of the same molecule. VSFlow will interpret conformers in a query SD file as part of a single multi-conformer molecule as long as they are contiguous in the file and have the same canonical SMILES. You can create multiple conformers for a single molecule with the preparedb mode of VSFlow (see Page "Prepare Databases" for more information). The file fda_confs.vsdb in the example folder, containing 20 conformers per molecule (32300 in total), was generated as follows:

vsflow preparedb -i fda.sdf -o fda_confs -c -np 12

You can also try this file for a shape search:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim

The above search takes a couple of minutes because it is only run on a single core. It is generally recommended to parallelize the shape search on multiple cores using the -np/--nproc argument to speed things up, e.g. on 6 cores/threads:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 6

VSFlow takes care you do not use more cores than available on your system. Python's built-in multiprocessing tools are used in the first place for parallelization. However, to further speed things up, it is additionally possible to use all available threads of your machine for the 3D alignment step via the C++ code of RDKit. To do so, simply specify the --boost flag:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 6 --boost

Caveat: Please make sure you do not run other important stuff at the same time because your machine may slow down! Parallelizing only the alignment step via RDKit's C++ code is also possible when using only one core:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim --boost

You may see how the run time differs for the above examples on your machine.

By default, the 10 most similar compounds for each query molecule are written to the output SD file. This may be changed with -t/--top_hits argument, e.g. to return the 20 most similar compounds:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -t 20

The default method for the alignment of the query conformer(s) to the database conformers is via MMFF atom types and charges. This may be changed to use calculated Crippen atomic logP contributions for the alignment instead with the -a/--align_method argument:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -a crippen

The Combo score (see above) is used by default to identify the most similar compounds. The combo score is the mean value of the shape similarity and the 3D pharmacophore fingerprint similarity. Instead, it is also possible to use the shape similarity or 3D pharmacophore similarity on its own via the -s/--score argument:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 -s shape

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 -s pharmacophore

Instead of simply returning the n most similar compounds, it is also possible to provide a cutoff value (between 0 and 1) for the respective score using the -c/--cutoff argument:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 -s shape -c 0.5

In the above case, only database compounds with a Tanimoto shape similarity higher than 0.5 are written to the output file(s), if any.

The shape similarity is calculated according to the Shape Tanimoto metric by default. Instead, the Shape Protrude (shortcut: protr) or Shape Tversky metric (shortcut tver) may be used with the --shape_simi argument:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --shape_simi protr

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --shape_simi tver

When using the non-symmetrical Shape Tversky metric, the alpha parameter (default = 0.5, weighs the query molecule) and the beta parameter (default = 0.5, weighs the database molecule) can additionally be adjusted, e.g.:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --shape_simi tver --tver_alpha 0.95 --tver_beta 0.05

In a comparable manner, the metric used to calculate the 3D pharmacophore fingerprint similarity can be adjusted independently:

Similarity Metrics for 3D pharmacophore fingerprints

tan: Tanimoto Similarity
dice: Dice Similarity
cos: Cosine Similarity
sok: Sokal Similarity
russ: Russel Similarity
kulc: Kulczynski Similarity
mcco: McConnaughey Similarity
tver: Tversky Similarity

Simply pass the respective shortcut to the --pharm_simi argument, for example:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --pharm_simi dice

The pharmacophore fingerprint may be further customized regarding the pharmacophore features to be used for its generation. By default, the pharmacophore definitions by Gobbi et al. are used. It is also possible to use some basic or minimal definitions from the RDKit instead with the --pharm_feats argument, e.g.:

vsflow shape -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --pharm_feats basic

You can have a deeper look into the pharmacophore definition files (.fdef) in the resources folder in the repository to see which SMARTS are used to define pharmacophores in each case.

VSFlow offers the possibility to quickly visualize the shape screening results as PDF (--pdf flag) and/or PyMOL (--pymol) file:

vsflow -i XED.sdf -d fda_confs.vsdb -o XED_sim -np 12 --pdf --pymol

The PDF file contains 2D depictions of the molecules. Additionally, the combo score, the shape similarity coefficient and the 3D pharmacophore fingerprint similarity value are annotated. The PyMOL file visualizes the 3D overlay of the query molecule and the returned database molecules.

Usage with 2D query molecules

The query compounds can also be provided as 2D representations. VSFlow automatically recognizes the 2D structures and generates 3D conformers for them, e.g. it is in principle possible to specify a mixed 2D/3D input file as query (however, this make no real sense). By default, 100 conformers for each 2D query molecule are generated and optimized with the MMFF94 force field. Only the lowest energy conformer is used for the subsequent shape screening by default. This may be changed with the --nconfs and --keep_confs arguments:

vsflow shape -i sample2D.sdf -d fda.sdf -o sample_shapesim -np 12 --nconfs 300 --keep_confs 5

In the above example, 300 conformers are generated and optimized and the 5 conformers with the lowest energy are used for the subsequent shape screening.

To generate reproducible conformers, a seed for the random number generator can be specified with the --seed argument:

vsflow shape -i sample2D.sdf -d fda.sdf -o sample_shapesim -np 12 --seed 42

It is also possible to directly specify SMILES representations of the query molecules on the command line with the -smi/--smiles argument:

vsflow shape -smi "CNCCC1=C(C2=CC=CC=C2)C=CN=C1" -d fda.sdf -o sample_shapesim -np 12 --keep_confs 3

Please make sure to put the SMILES in double quotes!

6-Manage-Databases

Manage Integrated Databases

As already outlined on Page "Prepare Databases", frequently used databases may be "integrated" into VSFlow. They are then stored in the folder "VSFlow_Databases" in the user's HOME directory ($HOME/VSFlow_Databases) or in the folder "Databases" in the VSFlow repository/installation folder ($Repository/Databases), e.g. the following will download all PDB ligands and store them as file named pdb.vsdb in the "VSFlow_Databases" folder in the HOME directory (see Page "Prepare Databases" for more information):

vsflow preparedb -d pdb -int pdb

You could also integrate the FDA drugs provided in the examples folder, additionally standardize them and generate fingerprints:

vsflow preparedb -i fda.sdf -int fda -s -f ecfp -np 6

You can now use the mode managedb to interact with the integrated databases.

General Usage

vsflow managedb {arguments}

To show all arguments:

vsflow managedb -h

Arguments

-h, --help
show this help message and exit
-s, --show
show currently integrated databases in VSFlow
--set_default
specify name/shortcut of database to be set as default
--remove
specify name/shortcut of database to be removed
--set_global
change path of folder were integrated databases which should be shared between different users are stored. Useful if VSFlow runs on a server. Default path is path/to/repository/Databases
--set_local
change path of folder where the user's integrated databases are stored, default is $HOME/VSFlow_Databases

Detailed Usage

You can show all integrated databases (that means all .vsdb files in the folder "VSFlow_Databases" in the HOME directory) as follows:

vsflow managedb -s

This will show you a table listing the databases with their name/shortcut, date created, standardized yes or no, canonical tautomer included yes or no, number of conformers per molecule, calculated fingerprints, and total number of compounds.

You can also move .vsdb files manually to the folder. They are then also considered as "integrated" and are listed when using the above command.

One advantage of integrated databases is the fact that you can use them from throughout the system by only passing the shortcut/name to -d/--database argument, you do not need to specify the full path.

You can customize two paths to integrate integrate databases:

vsflow managedb --set_local new/path/to/integrate/database

The above command will change the local database path ($HOME/VSFlow_Databases) to the new path.

vsflow managedb --set_global new/path/to/integrate/database/globally

This above command will set a global database path. This may be useful if VSFlow is run on a server and some databases should be accessible for different users.

It is also possible to remove integrated databases directly using VSFlow. Simply pass the database shortcut to the --remove argument:

vsflow managedb --remove shortcut/name

This will also delete the file from the disk, so you could also manually delete the file instead.

If you have a preferred database, you can also set it as the default database:

vsflow managedb --set_default shortcut/name

Now, if you do not specify the -d/--database argument (in mode substructure, fpsim and shape), this database is used by default.

7-Supported-File-Formats

File Formats

Query input files

The following files are supported using the -i/--input argument:

.sdf
.sdf.gz
.csv (containing SMILES or InChI)
.csv.gz
.tsv (containing SMILES or InChI)
.tsv.gz
.txt (containing SMILES or InChI)
.txt.gz
.xlsx (containing SMILES or InChI)
.smi (Example: http://ligand-expo.rcsb.org/dictionaries/Components-smiles-stereo-oe.smi)
.ich (Example: http://ligand-expo.rcsb.org/dictionaries/Components-inchi.ich)
.sma (see file sample.sma in the examples folder)

Database files

The following files are supported using the -d/--database argument:

.sdf
.sdf.gz
.vsdb (see Page "Prepare Databases" how to generate the file)

Output files

.sdf
.csv (not in mode shape)
.xlsx (not in mode shape)
.pdf
.pse (only in mode shape)

8-Use-cases-from-publication

Use cases from publication

All files can be found in the examples folder in this GitHub repository. For further explanation, please read the paper (in progress).

database fda.sdf: FDA-approved drugs from the ZINC database, >1600 mols
sd file 2gqg_C_1N1.sdf: from PDB

Substructure Search

vsflow substructure -sma "s:1:c:n:c:c:1" -d fda.sdf -o substructure.sdf --pdf

Searches for a thiazole substructure of the drugs. Generates sdf (substructure.sdf) as well as pdf (substructure.pdf) output file. For more information, see Chapter 3.

Fingerprint Similarity

vsflow fpsim -d fda.sdf -o fingerprint.xlsx --pdf --simmap -smi "Cc1cccc(c1NC(=O)c2cnc(s2)Nc3cc(nc(n3)C)N4CCN(CC4)CCO)Cl"

Default parameters (see Chapter 4). Creates excel (fingerprint.xlsx) and pdf output files (fingerprint.pdf).

Shape Similarity

Prepare database

vsflow preparedb -i fda.sdf -np 8 -c -o fda_multiple_confs.vsdb

Multiple conformers needed for shape screening. Chapter 2 provides more information on how to prepare databases.

Run shape screen

vsflow shape -i 2gqg_C_1N1.sdf -np 8 -d fda_multiple_confs.vsdb -o shape.sdf --pymol

Default parameters (see Chapter 5 for more details); creates two sd files and a pymol session file: query molecule (shape_1_query.sdf), found hits (shape_1.sdf), hits and query molecule (shape_1.pse).