Clustering SARS-COV-2 Mpro Fragment Hits with a Self-Organizing Map
Clustering SARS-COV-2 Mpro Fragment Hits with a Self-Organizing Map¶
Source code & credit: Pat Walters¶
Clustering Fragment Crystal Structures¶
In this notebook we'll use a few Python libraries to analyze a set of non-covalent fragment crystal structures binding the Mpro active site reported in https://fragalysis.diamond.ac.uk/viewer/react/preview/target/Mpro
Here's a brief outline of the workflow.
- Download the x-ray structures from the PDB
- Align the structures to a reference structures
- Write the aligned protein structures to disk
- Extract the ligands and assign bond orders
- Write ligands to disk
- Cluster using a Self-Organizing MAP (SOM)
0. Install libraries¶
Install the necessary packages
!pip install useful_rdkit_utils minisom prody seaborn
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools
import pandas as pd
from prody import *
from tqdm.auto import tqdm
from io import StringIO
from operator import itemgetter
from rdkit import RDLogger
import numpy as np
from rdkit.Chem.rdMolTransforms import ComputeCentroid
from minisom import MiniSom
import seaborn as sns
from glob import glob
import os
import useful_rdkit_utils as uru
1. Setup¶
Enable Pandas progress_apply
tqdm.pandas()
A function to extract a ligand from a complex, covert to an RDKit molecule, and assign bond orders.
def prody_ligand_bond_order(protein, res_name, res_smiles):
"""
Extract the ligand with the specified residue name from a Prody molecule and assign bond orders
:param protein: Prody protein
:param res_name: residue with the ligand
:param res_smiles: SMILES for the ligand, used for bond order assignment
:return: RDKit molecule with bond orders assigned
"""
output = StringIO()
sub_mol = protein.select(f"resname {res_name}")
template = AllChem.MolFromSmiles(res_smiles)
template = uru.get_largest_fragment(template)
writePDBStream(output, sub_mol)
pdb_string = output.getvalue()
rd_mol = AllChem.MolFromPDBBlock(pdb_string)
RDLogger.DisableLog('rdApp.warning')
new_mol = AllChem.AssignBondOrdersFromTemplate(template, rd_mol)
RDLogger.EnableLog('rdApp.warning')
return new_mol
df = pd.read_excel("./data/Mpro_frag_pdbs.xlsx")
df
2.2 Extract Ligands and Assign Bond Orders¶
- Align each protein to the reference
- Extract the ligand and assign bond orders
- Build up a list of ligands
Run ProDy
ProDy is a free and open-source Python package for protein structural dynamics analysis.
#Make Prody less chatty
prody.confProDy(verbosity='critical')
Set up a reference protein to align the other proteins to
ref_prot = prody.parsePDB('7K0F')
ref_prot = ref_prot.select("chain A")
rd_ligand_list = []
for p,smi, res in tqdm(df[['PDB','SMILES','Ligand_RESID']].values):
prot = prody.parsePDB(p)
# break if we can't get the PDB
if prot is None:
continue
# loop over chains and find the first one with the specified ligand
hv = prot.getHierView()
for chain in hv:
if chain.select(f"resname {res}"):
# align to the reference
atommaps = prody.alignChains(chain, ref_prot)
top_atommap = atommaps[0]
torf_mapped = top_atommap.getFlags("mapped")
prody.superpose(top_atommap, ref_prot, torf_mapped)
# save the aligned pdb
prody.writePDB(f"{p}_aligned.pdb",chain)
# get the ligand and assign bond orders
rd_lig_mol = prody_ligand_bond_order(chain,res,smi)
rd_lig_mol.SetProp("Name",p)
rd_lig_mol.SetProp("_Name",p)
rd_ligand_list.append(rd_lig_mol)
break
Let's see how many ligands we ended up with
len(rd_ligand_list)
Save the ligands to a file
writer = Chem.SDWriter("ligs_complete.sdf")
for mol in rd_ligand_list:
writer.write(mol)
writer.close()
2.3 Handle Multiple Ligands¶
Some of the pdb files have multiple copies of the same ligand bound in different pockets. Separate these into individual ligands
frag_mol_list = []
for idx,mol in enumerate(rd_ligand_list,1):
name = mol.GetProp("Name")
frag_list = Chem.GetMolFrags(mol,asMols=True)
for frag_idx,frag in enumerate(frag_list):
frag.SetProp("Name",name)
frag.SetProp("Sequence",str(frag_idx))
frag.SetProp("_Name",name)
frag_mol_list.append(frag)
print(f"Read {idx} molecules containing {len(frag_mol_list)} fragments")
Write the separated ligands to a file
writer = Chem.SDWriter("ligs_separated.sdf")
for mol in frag_mol_list:
writer.write(mol)
writer.close()
3. Calculate the Center of Each Ligand¶
Create a dataframe with the ligands
lig_df = pd.DataFrame({'ROMol' : frag_mol_list,
'Name': [x.GetProp("Name") for x in frag_mol_list],
'Sequence': [x.GetProp("Sequence") for x in frag_mol_list]
})
Calculate the geometric center for each ligand
lig_df['center'] = lig_df.ROMol.progress_apply(uru.get_center)
Create a numpy array containing the ligand centers
X = np.stack(lig_df.center)
4. Use a Self-Organizing Map (SOM) to Cluster the Ligands¶
Initialize the SOM
som = MiniSom(5,5,3, random_seed=1)
som.train(X,1000)
cell = [som.winner(i) for i in X]
Create column in the dataframe to hold the cell ids
lig_df['cell'] = cell
lig_df
Create a dataframe to hold the occupied SOM cells
cell_df = lig_df.cell.value_counts().to_frame().reset_index()
cell_df.columns = ["Cell","Count"]
Plot the SOM, the point size is proportional to the number of ligands in a cell. The numbers adjacent to the points are the cluster ids. Adjacent cells represent clusters that are close 3D in space (adjacent pockets) on the protein.
sns.set(rc={'figure.figsize': (10, 10)})
sns.set_style('whitegrid')
sns.set_context('talk')
cell_df['x'] = [x[0] for x in cell_df.Cell]
cell_df['y'] = [x[1] for x in cell_df.Cell]
ax = sns.scatterplot(x="x",y='y',size="Count",data=cell_df,sizes=(40,500),color="lightblue")
ax.legend(title="Count",loc='upper left', bbox_to_anchor=(1.00, 0.75), ncol=1)
cell_df.sort_values('Count',ascending=False)
cell_df['ix'] = range(0,len(cell_df))
for x,y,ix in cell_df[['x','y','ix']].values:
ax.text(x+0.1,y+0.1,ix)
# delete any files named som_cluster*.sdf
for filename in glob("som_cluster*.sdf"):
os.unlink(filename)
# write each cluster as an SDF
for i,x in enumerate(lig_df.cell.value_counts().index):
cluster_df = lig_df.query("cell == @x")
filename = f"som_cluster_{i:03d}.sdf"
print(filename, x, cluster_df.shape[0])
PandasTools.WriteSDF(cluster_df,filename,properties=["Name","Sequence"])
Write each cluster to an SD file. To look at the clusters with PyMol, you can do
pymol 7K0F_aligned.pdb som_cluster*.sdfThen, in PyMol, execute these commands
zoom all set all_states,on
"""MIT License
Copyright (c) 2021 Patrick Walters
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software."""