Cohens kappa
We love using Cohen's kappa in our group as a metric for testing our binary classification models. In this notebook, I will give a quick demo on how you can use Cohen's kappa for your binary classification models and make them great again!
The following work is based on my opinion. Credits of all the software used and demonstrated belongs to their respective authors and the community.
License: MIT
What is Cohen's kappa?¶
Cohen's kappa, or simply kappa, was introduced by Jacob Cohen) in 1960.
Some definitions:
Cohen defined kappa in the original paper as:
The coefficient $k$ is simply the proportion of chance-expected disagreements which do not occur.
The Wikipedia definition reads, "Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative items".
Scikit Learn definition reads, "a score that expresses the level of agreement between two annotators on a classification problem". The formula: $$\kappa = (p_{o} - p_{e})/(1 - p_{e})$$ where, $p_{o}$ = empirical probability of agreement on the label assigned to any sample (the observed agreement ratio) and, $p_{e}$ = the expected agreement when both annotators assign labels randomly
It is also commonly used by Cheminformaticians to evaluate their classification models more transparently rather than just using accuracy as metrics.
Good reads:
- In the original publication
- In Prof. Czodrowski's paper - "Count on kappa"
- Many tutorials present online!
How to calculate kappa for your models?¶
The easiest way to calculcate kappa for your models is to use the 'cohen_kappa_score' function offered by the Scikit learn package! That's it!
from sklearn.metrics import cohen_kappa_score
Once your model's predictions are ready, you can use this function as:
cohen_kappa_score(original values, predicted values)
Kappa values range from -1 to 1. If the score is below 0, it means that the model is predicting worse than random prediction, and the value of 1 indicate a perfect model. While there are no definite standards for which value of kappa should your models really have - the rule of thumb simply is that the higher value of kappa your model has, the better it is (without falling in the trap of overfitting).
Let's get modelling! I will use my same toy example I used last time in my Optuna 5 mins of fame notebook.
Toy problem: Models to predict a molecule's average molecular weight from its 2D fingerprint.¶
Let us prepare data and input strategy which we will use repeatedly in advance. I will import a subset of randomly selected 50000 compounds from CHEMBL 29
The only difference this time is that we predict sn arbitrary weight class instead of performing regression.
import pandas as pd
df = pd.read_parquet('Shortlisted_Chembl29-4Feb2022.parquet.gzip')
df.info()
df.head()
Let us use RDKit's Descriptor module to calculate average molecular weight of these compounds¶
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import Descriptors, MolFromSmiles
df['MolWt'] = df['Smiles'].apply(lambda x: Descriptors.MolWt(MolFromSmiles(x)))
df.head()
Generating Morgan fingerprints of n_bits = 2048 and radius 3¶
from rdkit import DataStructs
import numpy as np
def fetch_fp(smile):
fp = Chem.GetMorganFingerprintAsBitVect(MolFromSmiles(smile), 3, nBits = 2048)
arr = np.zeros((0, ), dtype=np.int8)
DataStructs.ConvertToNumpyArray(fp, arr)
return(arr)
df['Fingerprint'] = df['Smiles'].map(fetch_fp)
df.head()
Let us see a distribution of MolWt¶
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("seaborn-poster")
df.MolWt.plot.hist(alpha=0.5, figsize=(6, 5))
plt.xlabel('Average Molecular Weight Distribution')
plt.ylabel('Number of Molecules')
plt.show()
Let's arbitrary designate all the molecules which have molecular weight below 400 in class 0 and the rest in class 1.
df.loc[df['MolWt'] < 400, 'Weight_Class'] = 0
df.loc[df['MolWt'] >= 400, 'Weight_Class'] = 1
feat = 'Weight_Class'
df.head()
df[feat].value_counts()
df[feat].value_counts().plot(kind='bar')
plt.title("Class separation - Full dataset")
plt.show()
#Hmm, almost balanced!
#Let's define X = fingerprints and Y as class
X = np.stack(df.Fingerprint)
X.shape
type(X)
Y = df[feat].copy()
Y = Y.to_numpy()
Y.shape
type(Y)
#Dividing into training and test sets
import sklearn.model_selection as model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, train_size=0.8, test_size=0.2, stratify= df[feat])
#stratifying because it's always a good idea to maintain the same balance of classes in test-training sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Y_train shape: {Y_train.shape}")
print(f"Y_test shape: {Y_test.shape}")
#Using XGBoost classifier
from xgboost import XGBClassifier
clf = XGBClassifier(use_label_encoder = False)
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
df_test = pd.DataFrame()
df_test['Weight_Class_Original'] = Y_test
df_test['Weight_Class_Predicted'] = Y_pred
df_test.head()
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay
balanced_accuracy_score(df_test['Weight_Class_Original'], df_test['Weight_Class_Predicted'])
cohen_kappa_score(df_test['Weight_Class_Original'], df_test['Weight_Class_Predicted'])
#confusion matrix
cm = confusion_matrix(df_test['Weight_Class_Original'], df_test['Weight_Class_Predicted'])
cm
plt.rcParams.update({'font.size': 16})
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot()
plt.show()
plt.rcParams.update(plt.rcParamsDefault)
plt.style.use("seaborn-poster")
Remember k-fold cross validation? It is a necessary technique to check the model's performance on the data distribution. Since Scikit-Learn already offers amazing k-fold cross validation, we are going to see that
To simply get single metric cross validation score, one can use cross_val_score
function.
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_validate
skfold = StratifiedKFold(n_splits=10, shuffle = True) #in case your data has imbalanced classes
scores = cross_val_score(clf, X_train, Y_train, cv=skfold,
scoring = 'balanced_accuracy')
print(f"All scores (balanced accuracy): {scores}")
print(f"Mean score: {np.mean(scores)}")
But wait - why not use kappa instead? By default, kappa is not present as a scoring metric here.
import sklearn
sorted(sklearn.metrics.SCORERS.keys())
However, we can use another scikit-learn function called make_scorer
to help us!
from sklearn.metrics import make_scorer
kappa = make_scorer(cohen_kappa_score)
Now, we can use this as the metric
scores = cross_val_score(clf, X_train, Y_train, cv=skfold,
scoring = kappa)
print(f"All scores (kappa): {scores}")
print(f"Mean score: {np.mean(scores)}")
What if we need to see both? Here, we use the basic cross_validate
function and prepare a dictionary object and mention our desired metrics.
scoring = {'kappa': make_scorer(cohen_kappa_score), 'balanced_accuracy': 'balanced_accuracy'}
scores = cross_validate(clf, X_train, Y_train, cv= skfold, scoring=scoring,
return_train_score=True)
This is definitely helpful!
scores
print(scores['test_kappa'])
print(np.mean(scores['test_kappa']))
print(scores['train_kappa'])
print(np.mean(scores['train_kappa']))
print(scores['test_balanced_accuracy'])
print(np.mean(scores['test_balanced_accuracy']))
print(scores['train_balanced_accuracy'])
print(np.mean(scores['train_balanced_accuracy']))
Good, now your model is cross-validated!
Moving on - remember Optuna? The hyperparameter optimization software? Can we use Optuna to improve our models and keep kappa as the metric?
YES!
Just for demo purpose, we try to optimize on these two hyperparameters (with rather limited number of trials).
import optuna
def xgb_class_objective(trial):
# Optimizing the booster category.
booster = trial.suggest_categorical('booster', ["gbtree", "gblinear", "dart"])
#Optimizing the learning rate
learning_rate = trial.suggest_float('learning_rate', 1e-2, 1.0, log=True)
xgb_model = XGBClassifier(booster = booster, learning_rate = learning_rate, use_label_encoder = False)
xgb_model.fit(X_train, Y_train)
Y_test_pred_xgb = xgb_model.predict(X_test)
return cohen_kappa_score(Y_test, Y_test_pred_xgb)
xgb_study = optuna.create_study(direction="maximize", study_name = 'xgb_model_optimization')
xgb_study.optimize(xgb_class_objective, n_trials=25)
print("Best params: ", xgb_study.best_params)
print("Best value: ", xgb_study.best_value)
print("Best Trial: ", xgb_study.best_trial)
import plotly
from optuna.visualization import plot_optimization_history
plot_optimization_history(xgb_study)
fig = optuna.visualization.plot_param_importances(xgb_study)
fig.show()
Now, you can carry on with saving the best model as described in the past notebook.
Summary¶
We saw how to implement Cohen's kappa in your daily modelling life. We also learned that we may use it to optimize your models.
While kappa is an excellent metric, it is not perfect. You can read about its limitations in this blog by Maarit Widmann under the section, "Pain Points of Cohen’s Kappa". There's also this publication which finds that using kappa for multi-class classification problems might not be the best strategy.