Can't convert molecule to fingerprint with rdkit - python

I'm trying to convert molecular smiles into fingerprints using rdkit. I have two smiles:
Nc1cccc(N)n1 and Nc1cc(CSc2ccc(O)cc2)cc(N)n1. The first one was expanded into the second one. In other words, the second molecule contains the first one in its structure.
What I did was use rdkit to remove the common part to obtain smiles of a fragment that differs (CSC1=CC=C(O)C=C1 in kekulized form). I'm trying to convert that fragment into a molecule and then to a fingerprint to calculate similarity with a reference molecule.
Desired transformation
But I get an error: 'Can't kekulize atoms' with indices of those atoms. This is strange to me because all the smiles (the two input smiles and the resulting fragment smiles) can be easily visualized using MarvinSketch or Chemdraw (software for drawing molecules). I even had Marvin kekulize the fragment smiles and tried making a molecule from that but I still get the same error. Here is my code for removing the fragment:
def remove_initial_fragment(mol_smiles, fragment_smiles):
mol = Chem.MolFromSmiles(mol_smiles) #creates molecule from the longer smiles
fragment = Chem.MolFromSmiles(fragment_smiles) #the molecule I want to remove
rm = AllChem.DeleteSubstructs(mol, fragment) #creates new molecule
return Chem.MolToSmiles(rm) #converts the mol I want back into smiles
smiles_frags = [remove_initial_fragment(x, fragment_smiles) for x in smiles]
mols_frags = [Chem.MolFromSmiles(x) for x in smiles_frags]
In my case, the 'fragment_smiles' is the same for all selected smiles.
But then I get an error when trying to convert molecules from the 'mols_frags' list into fingerprints:
MFP_2 = [AllChem.GetMorganFingerprintAsBitVect(x, 2) for x in mols_frags]
I tried looking online for answers but nothing really helped. I even tried to create kekulized smiles separately and passing them directly as input for creating the fingerprints but I still get the same error.
It's super weird to me because when I try to do the same process with the same code for one set of smiles (fragment, longer smiles, resulting smiles), it works without a problem and I can create the fingerprint without any error. But it seems to me that once I input the smiles/molecules as a list, I get the error. Any idea why this could be? Or do you see any error in my code that I'm unaware of?

With fragment_smiles = 'Nc1cccc(N)n1' and a list like smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1']. I have no problem getting a fingerprint.
It looks as if, after deleting the substructure, there are some smiles_frags that are not correct SMILES.
To prove wich SMILES in the list gives the problem you can use
from rdkit.Chem import AllChem as Chem
fragment = Chem.MolFromSmiles('Nc1cccc(N)n1')
smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1', 'CC1=CC=Cc2c(N)nc(N)cc12']
for smi in smiles:
try:
mol = Chem.MolFromSmiles(smi)
f1 = Chem.DeleteSubstructs(mol, fragment)
f2 = Chem.MolFromSmiles(Chem.MolToSmiles(f1))
fp = Chem.GetMorganFingerprintAsBitVect(f2, 2)
except:
print('SMILES:', smi)
f = Chem.DeleteSubstructs(mol, fragment)
print('smiles_frag:', Chem.MolToSmiles(f1))
This will give:
SMILES: CC1=CC=Cc2c(N)nc(N)cc12
smiles_frag: ccccC

Related

rdkit.Chem.rdmolfiles.MolToMolFile(NoneType, str)

I am trying to convert smi to sdf format using rdkit python library. I am running following line of python code.
def convertir_smi_sdf(file_smi):
leer = [i for i in open(file_smi)]
print(f"Total de smi: {len(leer)}")
cont = 0
cont_tot = []
for i in leer:
nom_mol = i.split()[1]
smi_mol = i.split()[0]
mol_smi = Chem.MolFromSmiles(smi_mol)
Chem.MolToMolFile(mol_smi, f'{nom_mol}.sdf')
cont += 1
cont_tot.append(cont)
print(f"Se ha convertido {cont_tot[-1]} smiles a SDF")
Any help is highly appreciated
I need this to separate this smiles format in distints sdf archives.
Error:
Output:
These kinds of errors always mean one thing: The SMILES you're inputting is invalid. In your case, you're getting the error because of the SMILES string Cl[Pt](Cl)([NH4])[NH4] which is invalid. See its picture below. Both Nitrogen atoms are forming 5 bonds without any positive charge on them.
When you parse it in RdKit, you'll get a warning like this:
To deal with this, either fix this SMILES manually or ignore it completely. To ignore it, just pass the argument sanitize=False as below:
mol_smi = Chem.MolFromSmiles(smi_mol, sanitize=False)
Just a warning: by adding sanitize=False, you'll be ignoring all the invalid SMILES.

Find chiral centers rdkit

Working with some molecules and reactions, it seems that chiral centers in smiles may not be found after applying reactions.
What I get after applying some reactions on a molecule is this smile: C[C](C)[C]1[CH+]/C=C(\\C)CC/C=C(\\C)CC1
which actually seems to a have a chiral center in carbon 3 [C]. If I use Chem.FindMolChiralCenters(n,force=True,includeUnassigned=True) I get an empty list which means that there is no chiral center.
The thing is that if I add H to that Carbon 3 so it becomes [CH] it is recognized as chiral center but with unassigned type (R or S). I tried adding Hs using Chem.AddHs(mol) and then try again Chem.FindMolChiralCenters() but didn't get any chiral center.
I was wondering if there is a way to recognize this chiral center even if they are not added H and to set the proper chiral tag following some kind of rules.
Afer applying two 1,2 hydride shift to my initial mol (Chem.MolFromSmiles('C/C1=C\\C[C#H]([C+](C)C)CC/C(C)=C/CC1')) I get the smiles mentioned before. So given that I had some initial chiral tag I want to know if there is a way to recover lost chirality after reactions.
smarts used for 1,2 hydride shift: [Ch:1]-[C+1:2]>>[C+1:1]-[Ch+0:2]
mol = Chem.MolFromSmiles('C/C1=C\\C[C#H]([C+](C)C)CC/C(C)=C/CC1')
rxn = AllChem.ReactionFromSmarts('[Ch:1]-[C+1:2]>>[C+1:1]-[Ch+0:2]')
products = list()
for product in rxn.RunReactant(mol, 0):
Chem.SanitizeMol(product[0])
products.append(product[0])
print(Chem.MolToSmiles(products[0]))
After applying this reaction twice to the product created I eventually get this smile.
Output:
'C[C](C)[C]1[CH+]/C=C(\\C)CC/C=C(\\C)CC1'
which actually is where it is supposed to be a chiral center in carbon 3
Any idea or should I report it as a bug?
This is not a bug. I think you don't specify that you want a canonical smiles in the MolToSmiles function. So when I try:
mol = Chem.MolFromSmiles('C/C1=C\\C[C#H]([C+](C)C)CC/C(C)=C/CC1')
rxn = AllChem.ReactionFromSmarts('[Ch:1]-[C+1:2]>>[C+1:1]-[Ch+0:2]')
products = list()
for product in rxn.RunReactant(mol, 0):
Chem.SanitizeMol(product[0])
products.append(product[0])
print(Chem.MolToSmiles(products[0]))
Chem.MolToSmiles(ps[0][0])
I obtained exactly the same result as you:
'C[C](C)[CH+]1CC=C(C)CCC=C(C)CC1'
'CC1=CC[CH](CCC(C)=CCC1)=C(C)C'
but when you use this one:
Chem.MolToSmiles(ps[0][0], True)
You can obtain this result:
'CC(C)=[C#H]1C/C=C(\\C)CC/C=C(\\C)CC1'

appending an index to laspy file (.las)

I have two files, one an esri shapefile (.shp), the other a point cloud (.las).
Using laspy and shapefile modules I've managed to find which points of the .las file fall within specific polygons of the shapefile. What I now wish to do is to add an index number that enables identification between the two datasets. So e.g. all points that fall within polygon 231 should get number 231.
The problem is that as of yet I'm unable to append anything to the list of points when writing the .las file. The piece of code that I'm trying to do it in is here:
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
outFile1.points = truepoints
outFile1.points.append(indexfromshp)
outFile1.close()
The error I'm getting now is: AttributeError: 'numpy.ndarray' object has no attribute 'append'. I've tried multiple things already including np.append but I'm really at a loss here as to how to add anything to the las file.
Any help is much appreciated!
There are several ways to do this.
Las files have classification field, you could store the indexes in this field
las_file = laspy.file.File("las.las", mode="rw")
las_file.classification = indexfromshp
However if the Las file has version <= 1.2 the classification field can only store values in the range [0, 35], but you can use the 'user_data' field which can hold values in the range [0, 255].
Or if you need to store values higher than 255 / you need a separate field you can define a new dimension (see laspy's doc on how to add extra dimensions).
Your code should be close to something like this
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
# copy fields
for dimension in inFile.point_format:
dat = inFile.reader.get_dimension(dimension.name)
outFile1.writer.set_dimension(dimension.name, dat)
outFile1.define_new_dimension(
name="index_from_shape",
data_type=7, # uint64_t
description = "Index of corresponding polygon from shape file"
)
outFile1.index_from_shape = indexfromshp
outFile1.close()

Printing tip labels when using the python module dendropy to calculate pairwise distances between nodes on a phylogenetic tree?

I'm trying to create an array in python that will contain all the pairwise distances between every pair of nodes on a phylogenetic tree. I'm currently using dendropy to do this. (I initially looked at biopython but couldn't find an option to do this). The code I have so far looks like this:
import dendropy
tree_data = []
tree = dendropy.Tree.get(path="gonno_microreact_tree.nwk",schema="newick")
pdc = tree.phylogenetic_distance_matrix()
for i, t1 in enumerate(tree.taxon_namespace[:-1]):
for t2 in tree.taxon_namespace[i+1:]:
tip_pair = {}
tip_dist_list = []
tip_pair[t1] = t2
distance = pdc(t1, t2)
tip_dist_list.append(tip_pair)
tip_dist_list.append(distance)
tree_data.append(tip_dist_list)
print tree_data
This works well except for the way it writes the tip labels. For example an entry in the tree_data list looks like this:
[{<Taxon 0x7fc4c160b090 'ERS135651'>: <Taxon 0x7fc4c160b150 'ERS135335'>}, 0.0001294946558138355]
But the tips in the newick file are just labelled ERS135651 and ERS135335 respectively. How can I get dendropy to write the array with just the original tip labels so this entry would look like this:
[{ERS135651:ERS135335}, 0.0001294946558138355]
(Also I read the dendropy documentation and I'm aware that it says to use treecalc to do this, like this:
pdc = treecalc.PatristicDistanceMatrix(tree)
But I just get an error saying the command does not exist:
AttributeError: 'module' object has no attribute 'PairisticDistanceMatrix'
)
Any suggestions for how I can get this working?
Converting the tip labels to a string converted them to the name surrounded by speech marks, e.g.:
t1 = str(t1)
print t1
Gives:
"'ERS135651'"
So using string splicing to remove the extra speech marks works to convert the tip label back to it's proper name, e.g.:
t1 = t1.replace("'","")

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

Categories

Resources