I am trying to convert smi to sdf format using rdkit python library. I am running following line of python code.
def convertir_smi_sdf(file_smi):
leer = [i for i in open(file_smi)]
print(f"Total de smi: {len(leer)}")
cont = 0
cont_tot = []
for i in leer:
nom_mol = i.split()[1]
smi_mol = i.split()[0]
mol_smi = Chem.MolFromSmiles(smi_mol)
Chem.MolToMolFile(mol_smi, f'{nom_mol}.sdf')
cont += 1
cont_tot.append(cont)
print(f"Se ha convertido {cont_tot[-1]} smiles a SDF")
Any help is highly appreciated
I need this to separate this smiles format in distints sdf archives.
Error:
Output:
These kinds of errors always mean one thing: The SMILES you're inputting is invalid. In your case, you're getting the error because of the SMILES string Cl[Pt](Cl)([NH4])[NH4] which is invalid. See its picture below. Both Nitrogen atoms are forming 5 bonds without any positive charge on them.
When you parse it in RdKit, you'll get a warning like this:
To deal with this, either fix this SMILES manually or ignore it completely. To ignore it, just pass the argument sanitize=False as below:
mol_smi = Chem.MolFromSmiles(smi_mol, sanitize=False)
Just a warning: by adding sanitize=False, you'll be ignoring all the invalid SMILES.
Related
I am calculating the structure similarity profile between 2 moles using rdkit. When I am running the program in google colab (rdkit=2020.09.2 python=3.7) the program is working fine.
I am getting an error when I am running on my PC (rdkit=2021.03.2 python=3.8.5). The error is a bit strange. The dataframe contains 500 rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
ValueError: BitVects must be same length
The block of code is given below
data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
#Proff and make a list of Smiles and id
c_smiles = []
count = 0
for index, row in data.iterrows():
try:
cs = Chem.CanonSmiles(row['SMILES'])
c_smiles.append([row['ID_Name'], cs])
except:
count = count + 1
print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])
# make a list of id, smiles, and mols
ms = []
df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
for index, row in df.iterrows():
mol = Chem.MolFromSmiles(row['SMILES'])
ms.append([row['ID_Name'], row['SMILES'], mol])
# make a list of id, smiles, mols, and fingerprints (fp)
fps = []
df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
df_fps.head
for index, row in df_fps.iterrows():
fps_cal = FingerprintMols.FingerprintMol(row['mol'])
fps.append([row['ID_Name'], fps_cal])
fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
fps_2 = fps_2[fps_2.columns[1]]
fps_2 = fps_2.values.tolist()
# compare all fp pairwise without duplicates
for n in range(len(fps_2)):
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
for m in range(len(s)):
qu.append(c_smiles2[n])
ta.append(c_smiles2[n+1:][m])
sim.append(s[m])
Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2?
Reproducible Data
DB00607 [H][C#]12SC(C)(C)[C##H](N1C(=O)[C#H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C##H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C##]12[C#H]3CC[C#H](C3)[C#]1([H])C(=O)N(C[C##H]1CCCC[C#H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C#]1(C)CN(C[C##]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C#H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1
To answer first on how to install a specific version of Rdkit, you can run this command:
conda install -c rdkit rdkit=2020.09.2
Coming to the original question, the error is coming because of the function:
FingerprintMols.FingerprintMol()
For whatever internal reasons, it's converting the first 10 SMILES to a 2048 length vector while the 11th SMILES to a 1024 length vector. The older versions are able to handle this mismatch but newer versions can't. There are two options to fix this:
Downgrade RdKit to an older version using the command I mentioned above.
Fix the length of the vector by passing it as an argument. Basically, replace the line
FingerprintMols.FingerprintMol(row['mol'])
with
FingerprintMols.FingerprintMol(row['mol'], minPath=1, maxPath=7, fpSize=2048,
bitsPerHash=2, useHs=True, tgtDensity=0.0,
minSize=128)
In the replacement, all arguments other than fpSize are set to their default values and fpSize is fixed to 2048. Please note that you must pass all the arguments and not just fpSize.
Just to extend on mnis's answer, since FingerPrintMol defaults to the RDKFingerprint, you may find it easier to use it directly, as it is much more flexible, plus you will not have to supply all the arguments. Tested on version 2021.03.3
Chem.RDKFingerprint(row['mol'], fpSize=2048)
I'm trying to convert molecular smiles into fingerprints using rdkit. I have two smiles:
Nc1cccc(N)n1 and Nc1cc(CSc2ccc(O)cc2)cc(N)n1. The first one was expanded into the second one. In other words, the second molecule contains the first one in its structure.
What I did was use rdkit to remove the common part to obtain smiles of a fragment that differs (CSC1=CC=C(O)C=C1 in kekulized form). I'm trying to convert that fragment into a molecule and then to a fingerprint to calculate similarity with a reference molecule.
Desired transformation
But I get an error: 'Can't kekulize atoms' with indices of those atoms. This is strange to me because all the smiles (the two input smiles and the resulting fragment smiles) can be easily visualized using MarvinSketch or Chemdraw (software for drawing molecules). I even had Marvin kekulize the fragment smiles and tried making a molecule from that but I still get the same error. Here is my code for removing the fragment:
def remove_initial_fragment(mol_smiles, fragment_smiles):
mol = Chem.MolFromSmiles(mol_smiles) #creates molecule from the longer smiles
fragment = Chem.MolFromSmiles(fragment_smiles) #the molecule I want to remove
rm = AllChem.DeleteSubstructs(mol, fragment) #creates new molecule
return Chem.MolToSmiles(rm) #converts the mol I want back into smiles
smiles_frags = [remove_initial_fragment(x, fragment_smiles) for x in smiles]
mols_frags = [Chem.MolFromSmiles(x) for x in smiles_frags]
In my case, the 'fragment_smiles' is the same for all selected smiles.
But then I get an error when trying to convert molecules from the 'mols_frags' list into fingerprints:
MFP_2 = [AllChem.GetMorganFingerprintAsBitVect(x, 2) for x in mols_frags]
I tried looking online for answers but nothing really helped. I even tried to create kekulized smiles separately and passing them directly as input for creating the fingerprints but I still get the same error.
It's super weird to me because when I try to do the same process with the same code for one set of smiles (fragment, longer smiles, resulting smiles), it works without a problem and I can create the fingerprint without any error. But it seems to me that once I input the smiles/molecules as a list, I get the error. Any idea why this could be? Or do you see any error in my code that I'm unaware of?
With fragment_smiles = 'Nc1cccc(N)n1' and a list like smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1']. I have no problem getting a fingerprint.
It looks as if, after deleting the substructure, there are some smiles_frags that are not correct SMILES.
To prove wich SMILES in the list gives the problem you can use
from rdkit.Chem import AllChem as Chem
fragment = Chem.MolFromSmiles('Nc1cccc(N)n1')
smiles = ['Nc1cc(CSc2ccc(O)cc2)cc(N)n1', 'Nc1cc(COc2ccc(O)cc2)cc(N)n1', 'CC1=CC=Cc2c(N)nc(N)cc12']
for smi in smiles:
try:
mol = Chem.MolFromSmiles(smi)
f1 = Chem.DeleteSubstructs(mol, fragment)
f2 = Chem.MolFromSmiles(Chem.MolToSmiles(f1))
fp = Chem.GetMorganFingerprintAsBitVect(f2, 2)
except:
print('SMILES:', smi)
f = Chem.DeleteSubstructs(mol, fragment)
print('smiles_frag:', Chem.MolToSmiles(f1))
This will give:
SMILES: CC1=CC=Cc2c(N)nc(N)cc12
smiles_frag: ccccC
I am doing a project for faculty and I am getting an error. I am new to this language and I am following some steps, but the book that I am reading is a bit old, so some functions are outdated and I can't manage to get over one error. I searched about the function dataframe.set_value and I saw that this was changed to dataframe.at .
It goes like this :
for index, row in dataset.iterrows():
home_team = row["Home"]
visitor_team = row["Away"]
row["HomeLastWin"]=won_last[home_team]
dataset.at(index, "HomeLastWin") = won_last[home_team]
dataset.at(index, "VisitorLastWin") = won_last[visitor_team]
won_last[home_team] = int(row["HomeWin"])
won_last[visitor_team] = 1 - int(row["HomeWin"])
The original code found in the book was:
dataset.set_value(index,"HomeLastWin", won_last[home_team])
I understood that the parameters are dataset.at(What_row,What_column) = change_with_this.
The error I am getting is this:
File "<ipython-input-40-acfeaead26ef>", line 7
dataset.at(index, "HomeLastWin") = won_last[home_team]
^
SyntaxError: cannot assign to function call
Thank you for your time and answers!
See pandas documentation here.
You're using .at(), but want to use square brackets with .at[].
dataset.at[index, "HomeLastWin"] = won_last[home_team]
I am unable to make to_numeric to work in the code below:
tt = ['123.00','10,614,163,994.00']
pd.to_numeric(tt)
I get the following error:
ValueError: Unable to parse string "10,614,163,994.00" at position 1
please help.
to_numeric cannot handle the , as seperator for thousands, millions, ..
You should preprocess tt by something like tt = [n.replace(',','') for n in tt]
The second value in tt is not a number, in the limited definition of number for many parsers. Just remove the commas before trying to do the conversion.
tt = ['123.00','10,614,163,994.00']
tt = [x.replace(',','') for x in tt]
pd.to_numeric(tt)
I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)