Clustering SARS-COV-2 Mpro Fragment Hits with a Self-Organizing Map

Clustering SARS-COV-2 Mpro Fragment Hits with a Self-Organizing Map¶

Source code & credit: Pat Walters¶

https://github.com/PatWalters/blog_posts

Clustering Fragment Crystal Structures¶

In this notebook we'll use a few Python libraries to analyze a set of non-covalent fragment crystal structures binding the Mpro active site reported in https://fragalysis.diamond.ac.uk/viewer/react/preview/target/Mpro

Here's a brief outline of the workflow.

Download the x-ray structures from the PDB
Align the structures to a reference structures
Write the aligned protein structures to disk
Extract the ligands and assign bond orders
Write ligands to disk
Cluster using a Self-Organizing MAP (SOM)

0. Install libraries¶

Install the necessary packages

In [ ]:

!pip install useful_rdkit_utils minisom prody seaborn

In [1]:

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools
import pandas as pd
from prody import *
from tqdm.auto import tqdm
from io import StringIO
from operator import itemgetter
from rdkit import RDLogger 
import numpy as np
from rdkit.Chem.rdMolTransforms import ComputeCentroid
from minisom import MiniSom 
import seaborn as sns
from glob import glob
import os
import useful_rdkit_utils as uru

1. Setup¶

Enable Pandas progress_apply

In [2]:

tqdm.pandas()

A function to extract a ligand from a complex, covert to an RDKit molecule, and assign bond orders.

In [3]:

def prody_ligand_bond_order(protein, res_name, res_smiles):
    """
    Extract the ligand with the specified residue name from a Prody molecule and assign bond orders
    :param protein: Prody protein
    :param res_name: residue with the ligand
    :param res_smiles: SMILES for the ligand, used for bond order assignment
    :return: RDKit molecule with bond orders assigned
    """
    output = StringIO()
    sub_mol = protein.select(f"resname {res_name}")
    template = AllChem.MolFromSmiles(res_smiles)
    template = uru.get_largest_fragment(template)
    writePDBStream(output, sub_mol)
    pdb_string = output.getvalue()
    rd_mol = AllChem.MolFromPDBBlock(pdb_string)
    RDLogger.DisableLog('rdApp.warning')  
    new_mol = AllChem.AssignBondOrdersFromTemplate(template, rd_mol)
    RDLogger.EnableLog('rdApp.warning')  
    return new_mol

2. Reading and Processing Data From the PDB¶

2.1 Read the Input Data¶

Read the spreadsheet with info on the fragments. The spreadhseet maps the PDB code and ligand residue ID to the ligand SMILES. We can use this information in step 2.2 below to assign bond orders to the ligand.

In [4]:

df = pd.read_excel("./data/Mpro_frag_pdbs.xlsx")

In [5]:

df

Out[5]:

	PDB	SMILES	Ligand_RESID
0	5R7Y	CS(=O)(=O)NCCc1ccccc1	JFM 1001
1	5R7Z	CC(=O)NCCc1c[nH]c(c12)ccc(F)c2	T5Y 404
2	5R80	COC(=O)c1ccc(cc1)S(=O)(=O)N	RZG 404
3	5R81	C1CCN(C)c(c12)cc(cc2)S(=O)(=O)N	RZJ 1001
4	5R82	N#Cc1ccc(nc1)NCC	RZS 1001
5	5R83	c1ncccc1NC(=O)Nc2ccccc2	K0G 404
6	5R84	C1CCCCC1CC(=O)Nc2cccnc2	GWS 1001
7	5RE4	CC(=O)Nc(c1C)cncc1	SZY 404
8	5RE9	C1CN(C)CCN1C(=O)COc(cc2)ccc2C	LPZ 404
9	5REB	OC1CCN(CC1)Cc2ccsc2	T0Y 404
10	5REH	C1CCCCC1NC(=O)NCCc2ccncc2	AWP 404
11	5REZ	C1CC[C@H]([C@H]1C(=O)N)c2ccsc2	T54 404
12	5RF1	NS(=O)(=O)c1ccc(Br)cc1	T5G 404
13	5RF2	N=C(N)CC	HVB 404
14	5RF3	Nc1cncnc1	T5V 404
15	5RF6	N#Cc1ccc(cn1)N2CCCOCC2	NTG 404
16	5RF7	C1CN(C)CCN1C(=O)Cc2c[nH]c(c23)nccc3	T67 404
17	5RFE	N#Cc1ccc(cc1)CNC(=O)N2CCOCC2	JGG 404
18	5RG1	BrC#CCNC(=O)[C@@H](NC(=O)C)Cc1ccc(O)cc1	T9J 404
19	5RGH	Cc(s1)nnc1CN(C2)CCC=C2F	U0M 404
20	5RGI	C1CC1NC(=O)N(C)Cc2cc(C)on2	U0P 404
21	5RGK	c1cccc(F)c1C(=O)NCCc2ccncc2	U0V 404
22	5RGU	c1ncncc1CC(=O)Nc2cc(ccc2)O[C@@H](N3)CC3=O	UGD 1001
23	5RGV	c1ccccc1NC(=O)Cc2cncc(c23)cccc3	UGG 1001
24	5RGW	N#Cc1cc(cnc1)CC(=O)Nc2cccnc2	UGM 1001
25	5RGX	N#Cc1cc(ccc1)CC(=O)Nc(c2C)cncc2	UGP 1001
26	5RGY	COc1cc(ncc1)NC(=O)Cc(c2)ccc(c23)cccc3	UGS 1001
27	5RGZ	N#Cc1cc(ccc1)CC(=O)Nc2cccnc2	UH1 1001
28	5RH0	Cc1ccc(s1)NC(=O)Nc2cccnc2	UH4 1001
29	5RH1	s1c(Cl)ccc1CC(=O)Nc2cccnc2	UGV 1001
30	5RH2	c1cncc(c1C)NC(=O)Cc2cc(Cl)ccc2	UH7 1001
31	5RH3	c1cncc(c1C)NC(=O)[C@H](C)c2cc(Cl)ccc2	UHA 1001
32	5RHD	CS(=O)(=O)c(cc1)ccc1N2CCNCC2	US7 408
33	7K0F	c1ccccc1CNC(=O)C(O)[C@H](C[C@@H](C2=O)CCN2)NC(...	VR4 401

2.2 Extract Ligands and Assign Bond Orders¶

Align each protein to the reference
Extract the ligand and assign bond orders
Build up a list of ligands

Run ProDy
ProDy is a free and open-source Python package for protein structural dynamics analysis.

In [6]:

#Make Prody less chatty
prody.confProDy(verbosity='critical')

Set up a reference protein to align the other proteins to

In [7]:

ref_prot = prody.parsePDB('7K0F')
ref_prot = ref_prot.select("chain A")

In [8]:

rd_ligand_list = []
for p,smi, res in tqdm(df[['PDB','SMILES','Ligand_RESID']].values):
    prot = prody.parsePDB(p)
    # break if we can't get the PDB
    if prot is None:
        continue
    # loop over chains and find the first one with the specified ligand
    hv = prot.getHierView()
    for chain in hv:
        if chain.select(f"resname {res}"):
            # align to the reference
            atommaps = prody.alignChains(chain, ref_prot)
            top_atommap = atommaps[0]
            torf_mapped = top_atommap.getFlags("mapped")
            prody.superpose(top_atommap, ref_prot, torf_mapped)
            # save the aligned pdb
            prody.writePDB(f"{p}_aligned.pdb",chain)
            
            # get the ligand and assign bond orders
            rd_lig_mol = prody_ligand_bond_order(chain,res,smi)
            rd_lig_mol.SetProp("Name",p)
            rd_lig_mol.SetProp("_Name",p)
            rd_ligand_list.append(rd_lig_mol)
            break

Let's see how many ligands we ended up with

In [9]:

len(rd_ligand_list)

Out[9]:

Save the ligands to a file

In [10]:

writer = Chem.SDWriter("ligs_complete.sdf")
for mol in rd_ligand_list:
    writer.write(mol)
writer.close()

2.3 Handle Multiple Ligands¶

Some of the pdb files have multiple copies of the same ligand bound in different pockets. Separate these into individual ligands

In [11]:

frag_mol_list = []
for idx,mol in enumerate(rd_ligand_list,1):
    name = mol.GetProp("Name")
    frag_list = Chem.GetMolFrags(mol,asMols=True)
    for frag_idx,frag in enumerate(frag_list):
        frag.SetProp("Name",name)
        frag.SetProp("Sequence",str(frag_idx))
        frag.SetProp("_Name",name)
        frag_mol_list.append(frag)
print(f"Read {idx} molecules containing {len(frag_mol_list)} fragments")

Read 33 molecules containing 33 fragments

Write the separated ligands to a file

In [12]:

writer = Chem.SDWriter("ligs_separated.sdf")
for mol in frag_mol_list:
    writer.write(mol)
writer.close()

3. Calculate the Center of Each Ligand¶

Create a dataframe with the ligands

In [13]:

lig_df = pd.DataFrame({'ROMol' : frag_mol_list, 
                       'Name': [x.GetProp("Name") for x in frag_mol_list],
                       'Sequence': [x.GetProp("Sequence") for x in frag_mol_list]
                      })

Calculate the geometric center for each ligand

In [14]:

lig_df['center'] = lig_df.ROMol.progress_apply(uru.get_center)

Create a numpy array containing the ligand centers

In [15]:

X = np.stack(lig_df.center)

4. Use a Self-Organizing Map (SOM) to Cluster the Ligands¶

Initialize the SOM

In [16]:

som = MiniSom(5,5,3, random_seed=1)
som.train(X,1000)
cell = [som.winner(i) for i in X]

Create column in the dataframe to hold the cell ids

In [17]:

lig_df['cell'] = cell

In [18]:

lig_df

Out[18]:

	ROMol	Name	center	cell
0	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4350>	5R7Y	[13.249461538461539, 22.272230769230774, 30.09...	(4, 3)
1	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4820>	5R80	[13.964214285714288, 25.18, 26.190714285714286]	(4, 1)
2	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4200>	5R81	[13.6084, 24.955733333333335, 26.351799999999994]	(4, 1)
3	<rdkit.Chem.rdchem.Mol object at 0x7fae049f43c0>	5R82	[13.869000000000002, 23.278363636363636, 28.25...	(4, 2)
4	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4430>	5R83	[10.689875, 22.155312500000004, 26.73431249999...	(0, 1)
5	<rdkit.Chem.rdchem.Mol object at 0x7fae049f49e0>	5R84	[10.98325, 21.9586875, 27.040125000000003]	(0, 0)
6	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4a50>	5RE4	[9.023272727272728, 21.75272727272727, 26.5504...	(0, 3)
7	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4ac0>	5RE9	[12.346777777777781, 25.580666666666662, 31.04...	(3, 3)
8	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4b30>	5REB	[13.781538461538462, 22.16646153846154, 30.418...	(4, 3)
9	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4ba0>	5REH	[5.17038888888889, 24.85977777777778, 27.97638...	(0, 4)
10	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4c10>	5REZ	[12.406538461538462, 23.27292307692307, 27.330...	(2, 2)
11	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4c80>	5RF1	[13.512363636363638, 25.644272727272725, 25.69...	(4, 1)
12	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4cf0>	5RF2	[12.7456, 20.584600000000002, 32.455200000000005]	(4, 4)
13	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4d60>	5RF3	[6.657142857142858, 21.22457142857143, 26.1192...	(1, 4)
14	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4dd0>	5RF6	[12.313399999999998, 21.69826666666667, 30.625...	(3, 4)
15	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4e40>	5RF7	[10.342368421052633, 22.08431578947368, 26.859...	(0, 1)
16	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4eb0>	5RFE	[11.483722222222223, 23.66111111111111, 28.300...	(2, 3)
17	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4f20>	5RG1	[10.03515, 22.9566, 26.73505]	(1, 2)
18	<rdkit.Chem.rdchem.Mol object at 0x7fae049f4f90>	5RGH	[13.306785714285711, 22.283428571428573, 29.35...	(4, 3)
19	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5000>	5RGI	[8.206866666666667, 20.15246666666667, 28.1496...	(1, 4)
20	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5070>	5RGK	[5.286, 24.998944444444447, 28.500944444444446]	(0, 4)
21	<rdkit.Chem.rdchem.Mol object at 0x7fae049f50e0>	5RGU	[11.614136363636364, 23.393636363636368, 25.39...	(2, 0)
22	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5150>	5RGV	[9.3784, 22.70645, 26.665900000000004]	(1, 3)
23	<rdkit.Chem.rdchem.Mol object at 0x7fae049f51c0>	5RGW	[11.120777777777777, 21.964000000000002, 26.59...	(0, 0)
24	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5230>	5RGX	[10.981578947368423, 22.05821052631579, 26.693...	(0, 0)
25	<rdkit.Chem.rdchem.Mol object at 0x7fae049f52a0>	5RGY	[12.213227272727273, 26.963545454545454, 24.93...	(3, 1)
26	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5310>	5RGZ	[11.150666666666668, 21.971944444444446, 26.69...	(0, 0)
27	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5380>	5RH0	[10.2895625, 22.4358125, 26.394125]	(0, 2)
28	<rdkit.Chem.rdchem.Mol object at 0x7fae049f53f0>	5RH1	[10.4460625, 22.320187499999996, 26.7326874999...	(1, 1)
29	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5460>	5RH2	[10.565833333333334, 22.149944444444444, 26.79...	(0, 1)
30	<rdkit.Chem.rdchem.Mol object at 0x7fae049f54d0>	5RH3	[10.488263157894735, 22.096000000000004, 26.96...	(0, 1)
31	<rdkit.Chem.rdchem.Mol object at 0x7fae049f5540>	5RHD	[12.382250000000003, 22.348062499999994, 29.70...	(3, 4)
32	<rdkit.Chem.rdchem.Mol object at 0x7fae049f55b0>	7K0F	[9.61397435897436, 22.3984358974359, 27.662358...	(1, 3)

Create a dataframe to hold the occupied SOM cells

In [19]:

cell_df = lig_df.cell.value_counts().to_frame().reset_index()
cell_df.columns = ["Cell","Count"]

Plot the SOM, the point size is proportional to the number of ligands in a cell. The numbers adjacent to the points are the cluster ids. Adjacent cells represent clusters that are close 3D in space (adjacent pockets) on the protein.

In [20]:

sns.set(rc={'figure.figsize': (10, 10)})
sns.set_style('whitegrid')
sns.set_context('talk')
cell_df['x'] = [x[0] for x in cell_df.Cell]
cell_df['y'] = [x[1] for x in cell_df.Cell]
ax = sns.scatterplot(x="x",y='y',size="Count",data=cell_df,sizes=(40,500),color="lightblue")
ax.legend(title="Count",loc='upper left', bbox_to_anchor=(1.00, 0.75), ncol=1)
cell_df.sort_values('Count',ascending=False)
cell_df['ix'] = range(0,len(cell_df))
for x,y,ix in cell_df[['x','y','ix']].values:
    ax.text(x+0.1,y+0.1,ix)

In [ ]:

# delete any files named som_cluster*.sdf
for filename in glob("som_cluster*.sdf"):
    os.unlink(filename)

# write each cluster as an SDF
for i,x in enumerate(lig_df.cell.value_counts().index):
    cluster_df = lig_df.query("cell == @x")
    filename = f"som_cluster_{i:03d}.sdf"
    print(filename, x, cluster_df.shape[0])
    PandasTools.WriteSDF(cluster_df,filename,properties=["Name","Sequence"])

Write each cluster to an SD file. To look at the clusters with PyMol, you can do

pymol 7K0F_aligned.pdb som_cluster*.sdf

Then, in PyMol, execute these commands

zoom all
set all_states,on

In [1]:

"""MIT License

Copyright (c) 2021 Patrick Walters

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software."""

Out[1]:

'MIT License\n\nCopyright (c) 2021 Patrick Walters\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the "Software"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.'

In [ ]: