Optuna Pt.2


Track Optuna Hyperparameter Tuning with MLflow


Author: Marcel Baltruschat (@GitHub)
Date: 18.03.2022
License: MIT


Installation with Conda

conda create -n opt_ml -c conda-forge python=3.9 jupyterlab rdkit scikit-learn optuna mlflow ipywidgets multiprocess pytorch

Remark 1: ipywidgets is not directly used, but some imports trigger a warning if it is not installed
Remark 2: multiprocess is a fork of Python's multiprocessing module, that supports interactive usage of Pool on macOS.
Remark 3: On Windows, Pool multiprocessing in Jupyter Notebooks seems not to work and is disabled.


Imports and Settings

In [1]:
import platform
OS = platform.system()

if OS == 'Linux':
    from multiprocessing import Pool
elif OS == 'Darwin':
    from multiprocess import Pool

import sys
import warnings
from subprocess import Popen

import mlflow
import numpy as np
import optuna
import pandas as pd
import rdkit
import torch
from rdkit.Chem import AllChem as Chem, Descriptors, Crippen
from rdkit.DataStructs import ConvertToNumpyArray
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import cohen_kappa_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
In [2]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

random_seed = 42
num_cores = 12

Used Versions

In [3]:
print(f'Python: {sys.version.split("|")[0]}\nMLflow: {mlflow.__version__}\nOptuna: {optuna.__version__}\nRDKit: {rdkit.__version__}\nPyTorch: {torch.__version__}')
Python: 3.9.10 
MLflow: 1.24.0
Optuna: 2.10.0
RDKit: 2021.09.5
PyTorch: 1.10.2


Start MLflow Server

It get's terminated by stopping / restarting / interrupting the kernel.

In [4]:
# Adjust host and port as necessary
mlflow_proc = Popen(['mlflow', 'ui'])  # '-h', '0.0.0.0', '-p', '8891'
[2022-03-18 09:02:29 +0100] [48103] [INFO] Starting gunicorn 20.1.0
[2022-03-18 09:02:29 +0100] [48103] [INFO] Listening at: http://127.0.0.1:5000 (48103)
[2022-03-18 09:02:29 +0100] [48103] [INFO] Using worker: sync
[2022-03-18 09:02:29 +0100] [48104] [INFO] Booting worker with pid: 48104

You can already visit http://localhost:5000


Loading Example Dataset

The original dataset was published by Ogura et al. (2019) [1].
The only changes made were the conversion from XLSX to CSV and the filtering out of all molecules with invalid valences.

In [5]:
df = pd.read_csv('datasets/example_dataset_hERG.csv', names=['smi', 'act'], header=0)
print(len(df))
df.head(2)
190464
Out[5]:
smi act
0 CCOC(=O)C1CCN(CC1)C(C)C(=O)c2c(C)[nH]c3cc(C)ccc23 1
1 COc1ccc(NC(=O)N2CCC3(CCN(C)CC3)CC2)cc1F 1
In [6]:
df.act.value_counts()
Out[6]:
0    184044
1      6420
Name: act, dtype: int64

=> Very unbalanced dataset

Perform Undersampling Based on MolWt and LogP

In [7]:
def data_from_smi(smi):
    mol = Chem.MolFromSmiles(smi)
    mw_logp = Descriptors.MolWt(mol) / 100 * Crippen.MolLogP(mol)
    return mol, mw_logp
In [8]:
if OS == 'Windows':
    res = np.array(list(map(data_from_smi, df.smi)))
else:
    with Pool(num_cores) as p:
        res = np.array(p.map(data_from_smi, df.smi))
df['ROMol'], df['mw_logp'] = res[:, 0], res[:, 1]
df.sort_values('mw_logp', inplace=True)
In [9]:
act0 = df.query('act == 0').reset_index(drop=True)
ix = np.linspace(0, len(act0) - 1, num=len(df) - len(act0), dtype=np.int32)
df = pd.concat([df.query('act == 1'), act0.loc[ix]])
df.act.value_counts()
Out[9]:
1    6420
0    6420
Name: act, dtype: int64

Calculating Morgan Fingerprints (FCFP6)

In [10]:
def mol_to_FCFP6(mol):
    fp = Chem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=True)
    ar = np.empty(2048, dtype=np.uint8)
    ConvertToNumpyArray(fp, ar)
    return ar
In [11]:
x_data = np.array(list(map(mol_to_FCFP6, df.ROMol)))
x_data.shape
Out[11]:
(12840, 2048)

Split Into Training and Test Datasets

In [12]:
y_data = np.array(df.act)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
                                                    test_size=0.1,
                                                    stratify=y_data,
                                                    random_state=random_seed,
                                                    shuffle=True)

Optimize models with Optuna and Track Results with MLflow

In [13]:
# Activate autologging for Scikit-learn
mlflow.sklearn.autolog()

Scikit-Learn Random Forest (with auto logging)

If you want to avoid having duplicated parameter sets you need to uncomment the commented code lines

In [14]:
# seen_param = []

def rf_obj(trial):
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 1000, step=50),
        'max_depth':    trial.suggest_int('max_depth', 1, 20),
        'criterion':    trial.suggest_categorical('criterion', ['gini', 'entropy']),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2', None]),
        'random_state': random_seed,
        'n_jobs':       num_cores,
    }
    # if param in seen_param:
    #     raise optuna.exceptions.TrialPruned()
    # else:
    #     seen_param.append(param)
    metrics = {}
    model = RandomForestClassifier(**param)
    with mlflow.start_run():
        model.fit(x_train, y_train)
        metrics['test_kappa'] = cohen_kappa_score(y_test, model.predict(x_test))
        metrics['training_kappa'] = cohen_kappa_score(y_train, model.predict(x_train))
        mlflow.log_metrics(metrics)
    return metrics['test_kappa']
In [15]:
# Creates a new MLflow experiment and set it as active
mlflow.set_experiment('hERG Random Forest')

# Creates a new Optuna study for maximizing an outcome
study = optuna.create_study(direction='maximize')

# MLflow currently uses scikit-learn functions for metric calculation that were deprecated with version 1.0
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    # Optimize the objective function
    study.optimize(rf_obj, n_trials=100)
2022/03/18 09:03:12 INFO mlflow.tracking.fluent: Experiment with name 'hERG Random Forest' does not exist. Creating a new experiment.

Scikit-Learn MLP (with auto logging)

In [16]:
# seen_param = []

def nn_obj(trial):
    n_hidden_layers = trial.suggest_int('n_hidden_layers', 0, 4)
    n_neurons = trial.suggest_int('n_neurons', 16, 128) if n_hidden_layers > 0 else 0
    param = {
        'hidden_layer_sizes': [n_neurons] * n_hidden_layers,
        'activation':         trial.suggest_categorical('activation', ['identity', 'logistic', 'tanh', 'relu']),
        'solver':             trial.suggest_categorical('solver', ['lbfgs', 'sgd', 'adam']),
        'alpha':              trial.suggest_float('alpha', 0.00001, 0.1, log=True),
        'learning_rate_init': trial.suggest_float('learning_rate', 0.00001, 0.1, log=True),
        'max_iter':           trial.suggest_int('epochs', 20, 300),
        'random_state':       random_seed,
    }
    # if param in seen_param:
    #     raise optuna.exceptions.TrialPruned()
    # else:
    #     seen_param.append(param)
    metrics = {}
    model = MLPClassifier(**param)
    with mlflow.start_run():
        mlflow.log_params(dict(n_hidden_layers=n_hidden_layers, n_neurons=n_neurons))
        model.fit(x_train, y_train)
        metrics['test_kappa'] = cohen_kappa_score(y_test, model.predict(x_test))
        metrics['training_kappa'] = cohen_kappa_score(y_train, model.predict(x_train))
        mlflow.log_metrics(metrics)
    return metrics['test_kappa']
In [17]:
# Creates a new MLflow experiment and set it as active
mlflow.set_experiment('hERG MLP')

# Creates a new Optuna study for maximizing an outcome
study = optuna.create_study(direction='maximize')

# MLflow currently uses scikit-learn functions for metric calculation that were deprecated with version 1.0
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    # Optimize the objective function
    study.optimize(nn_obj, n_trials=100)
2022/03/18 19:38:15 INFO mlflow.tracking.fluent: Experiment with name 'hERG MLP' does not exist. Creating a new experiment.

PyTorch MLP (manual logging)

In [18]:
class hERGDataset(Dataset):
    def __init__(self, x_data, y_data):
        self.x_data = x_data.astype(np.float32)
        self.y_data = y_data.astype(np.float32).reshape(-1, 1)

    def __len__(self):
        return len(self.y_data)

    def __getitem__(self, idx):
        return self.x_data[idx], self.y_data[idx]


train_ds = hERGDataset(x_train, y_train)
test_ds = hERGDataset(x_test, y_test)
In [19]:
# seen_param = []

def pt_nn_obj(trial):
    param = dict(
        n_hidden_layers = trial.suggest_int('n_hidden_layers', 0, 3),
        n_neurons       = trial.suggest_int('n_neurons', 16, 128),
        act             = trial.suggest_categorical('activation', ['Sigmoid', 'Tanh', 'ReLU']),
        lr              = trial.suggest_float('learning_rate', 0.00001, 0.1, log=True),
        epochs          = trial.suggest_int('epochs', 20, 300),
        batch_size      = trial.suggest_int('batch_size', 1, 128),
        random_seed     = random_seed,
        optimizer       = 'Adam',
        criterion       = 'binary_crossentropy',
    )
    if param['n_hidden_layers'] == 0:
        param['n_neurons'] = 0
    # if param in seen_param:
    #     raise optuna.exceptions.TrialPruned()
    # else:
    #     seen_param.append(param)

    torch.manual_seed(param['random_seed'])
    np.random.seed(param['random_seed'])

    act_func = eval(f'nn.{param["act"]}')

    layers = [nn.Linear(2048, param['n_neurons']), act_func()]  # input layer
    for i in range(param['n_hidden_layers']):
        layers.append(nn.Linear(param['n_neurons'], param['n_neurons']))
        layers.append(act_func())
    layers.append(nn.Linear(param['n_neurons'], 1))  # output layer
    layers.append(nn.Sigmoid())
    model = nn.Sequential(*layers)

    opt = optim.Adam(model.parameters(), lr=param['lr'])
    crit = nn.BCELoss()

    train_loader = DataLoader(train_ds, batch_size=param['batch_size'], shuffle=True)
    test_loader = DataLoader(test_ds, batch_size=param['batch_size'], shuffle=False)

    with mlflow.start_run():
        mlflow.log_params(param)
        for i in range(param['epochs']):
            model.train()
            train_loss = 0
            train_kappa = 0
            for data, labels in train_loader:
                opt.zero_grad()
                out = model(data)
                loss = crit(out, labels)
                loss.backward()
                opt.step()
                train_loss += loss.item()
                train_kappa += cohen_kappa_score(labels.data.numpy(), out[:,-1].detach().numpy().round())

            model.eval()
            metrics = dict(train_loss=train_loss / len(train_loader), train_kappa=train_kappa / len(train_loader))
            with torch.no_grad():
                test_loss = 0
                test_kappa = 0
                for data, labels in test_loader:
                    out = model(data)
                    test_loss += crit(out, labels).item()
                    test_kappa += cohen_kappa_score(labels.data.numpy(), out[:,-1].numpy().round())
            metrics['test_loss'] = test_loss / len(test_loader)
            metrics['test_kappa'] = test_kappa / len(test_loader)
            mlflow.log_metrics(metrics, step=i + 1)
        mlflow.pytorch.log_model(model, 'model')
    return metrics['test_kappa']
In [20]:
# Creates a new MLflow experiment and set it as active
mlflow.set_experiment('hERG PyTorch NN')

# Creates a new Optuna study for maximizing an outcome
study = optuna.create_study(direction='maximize')

# MLflow currently uses scikit-learn functions for metric calculation that were deprecated with version 1.0
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    # Optimize the objective function
    study.optimize(pt_nn_obj, n_trials=100)
2022/03/18 20:17:35 INFO mlflow.tracking.fluent: Experiment with name 'hERG PyTorch NN' does not exist. Creating a new experiment.
[W 2022-03-18 20:47:08,936] Trial 20 failed, because the objective function returned nan.
[W 2022-03-18 21:12:11,468] Trial 28 failed, because the objective function returned nan.
[W 2022-03-18 21:22:34,855] Trial 35 failed, because the objective function returned nan.
[W 2022-03-18 21:34:25,375] Trial 38 failed, because the objective function returned nan.
[W 2022-03-18 21:41:10,054] Trial 42 failed, because the objective function returned nan.

While Optuna is "studying" you can already investigate finished results on the MLflow server webpage...

http://localhost:5000

All results that are shown on the MLflow page are retrieved from the local folder mlruns in the current directory.


References

[1] Ogura, K., Sato, T., Yuki, H. et al. Support Vector Machine model for hERG inhibitory activities based on the integrated hERG database using descriptor selection by NSGA-II. Sci Rep 9, 12220 (2019). https://doi.org/10.1038/s41598-019-47536-3

Disclaimer

The configurations for modelling and hyperparamter optimization might be suboptimal for the specific task and dataset. Since the key point of this notebook is to show the usage of the MLflow tracking feature together with Optuna, configuration optimisation was neglected.