ValueError: BitVects must be same length (rdkit) - python

I am calculating the structure similarity profile between 2 moles using rdkit. When I am running the program in google colab (rdkit=2020.09.2 python=3.7) the program is working fine.
I am getting an error when I am running on my PC (rdkit=2021.03.2 python=3.8.5). The error is a bit strange. The dataframe contains 500 rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
    ValueError: BitVects must be same length
The block of code is given below
data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
#Proff and make a list of Smiles and id
c_smiles = []
count = 0
for index, row in data.iterrows():
try:
cs = Chem.CanonSmiles(row['SMILES'])
c_smiles.append([row['ID_Name'], cs])
except:
count = count + 1
print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])
# make a list of id, smiles, and mols
ms = []
df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
for index, row in df.iterrows():
mol = Chem.MolFromSmiles(row['SMILES'])
ms.append([row['ID_Name'], row['SMILES'], mol])
# make a list of id, smiles, mols, and fingerprints (fp)
fps = []
df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
df_fps.head
for index, row in df_fps.iterrows():
fps_cal = FingerprintMols.FingerprintMol(row['mol'])
fps.append([row['ID_Name'], fps_cal])
fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
fps_2 = fps_2[fps_2.columns[1]]
fps_2 = fps_2.values.tolist()
# compare all fp pairwise without duplicates
for n in range(len(fps_2)):
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
for m in range(len(s)):
qu.append(c_smiles2[n])
ta.append(c_smiles2[n+1:][m])
sim.append(s[m])
Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2?
Reproducible Data
DB00607 [H][C#]12SC(C)(C)[C##H](N1C(=O)[C#H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C##H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C##]12[C#H]3CC[C#H](C3)[C#]1([H])C(=O)N(C[C##H]1CCCC[C#H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C#]1(C)CN(C[C##]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C#H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1

To answer first on how to install a specific version of Rdkit, you can run this command:
conda install -c rdkit rdkit=2020.09.2
Coming to the original question, the error is coming because of the function:
FingerprintMols.FingerprintMol()
For whatever internal reasons, it's converting the first 10 SMILES to a 2048 length vector while the 11th SMILES to a 1024 length vector. The older versions are able to handle this mismatch but newer versions can't. There are two options to fix this:
Downgrade RdKit to an older version using the command I mentioned above.
Fix the length of the vector by passing it as an argument. Basically, replace the line
FingerprintMols.FingerprintMol(row['mol'])
with
FingerprintMols.FingerprintMol(row['mol'], minPath=1, maxPath=7, fpSize=2048,
bitsPerHash=2, useHs=True, tgtDensity=0.0,
minSize=128)
In the replacement, all arguments other than fpSize are set to their default values and fpSize is fixed to 2048. Please note that you must pass all the arguments and not just fpSize.

Just to extend on mnis's answer, since FingerPrintMol defaults to the RDKFingerprint, you may find it easier to use it directly, as it is much more flexible, plus you will not have to supply all the arguments. Tested on version 2021.03.3
Chem.RDKFingerprint(row['mol'], fpSize=2048)

Related

Python consumes excessive memory, doesn't complete run even given adequate memory

I am writing a code for an information retrieval project, which reads Wikipedia pages in XML format from a file, processes the string (I've omitted this part for the sake of simplicity), tokenizes the strings and build positional indexes for the terms found on the pages. Then it saves the indexes to a file using pickle once, and then reads it from that file for the next usages for less processing time (I've included the code for that parts, but they're commented)
After that, I need to fill a 1572 * ~97000 matrix (1572 is the number of Wiki pages, and 97000 is the number of terms found in them. Like each Wiki page is a vector of words, and vectors[i][j] is, the number of occurrences of the i'th word of the word set in the j'th Wiki Page. (Again it's been simplified but it doesn't matter)
The problem is that it takes way too much memory to run the code, and even then, from a point between the 350th and 400th row of the matrix beyond, it doesn't proceed to run the code (it doesn't stop either). I thought the problem was with memory, because when its usage exceeded my 7.7GiB RAM and 1.7GiB swap, it stopped and printed:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
But when I added a 6GiB memory by making a swap file for Python3.7 (using the script recommended here, the program didn't run out of memory, but instead got stuck when it had 7.7GiB RAM + 3.9GiB swap memory occupied) as I said, at a point between the 350th and 400th iteration of i in the loop at the bottom. Instead of Ubuntu 18.04,I tried it on Windows 10, the screen simply went black. I tried this on Windows 7, again to no avail.
Next I thought it was a PyCharm issue, so I ran the python file using the python3 file.py command, and it got stuck at the very point it had with PyCharm. I even used the numpy.float16 datatype to save memory, but it had no effect. I asked a colleague about their matrix dimensions, they were similar to mine, but they weren't having problems with it. Is it malware or a memory leak? Or is it something am I doing something wrong here?
import pickle
from hazm import *
from collections import defaultdict
import numpy as np
'''For each word there's one of these. it stores the word's frequency, and the positions it has occurred in on each wiki page'''
class Positional_Index:
def __init__(self):
self.freq = 0
self.title = defaultdict(list)
self.text = defaultdict(list)
'''Here I tokenize words and construct indexes for them'''
# tree = ET.parse('Wiki.xml')
# root = tree.getroot()
# index_dict = defaultdict(Positional_Index)
# all_content = root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')
#
# for page_index, pg in enumerate(all_content):
# title = pg.find('{http://www.mediawiki.org/xml/export-0.10/}title').text
# txt = pg.find('{http://www.mediawiki.org/xml/export-0.10/}revision') \
# .find('{http://www.mediawiki.org/xml/export-0.10/}text').text
#
# title_arr = word_tokenize(title)
# txt_arr = word_tokenize(txt)
#
# for term_index, term in enumerate(title_arr):
# index_dict[term].freq += 1
# index_dict[term].title[page_index] += [term_index]
#
# for term_index, term in enumerate(txt_arr):
# index_dict[term].freq += 1
# index_dict[term].text[page_index] += [term_index]
#
# with open('texts/indices.txt', 'wb') as f:
# pickle.dump(index_dict, f)
with open('texts/indices.txt', 'rb') as file:
data = pickle.load(file)
'''Here I'm trying to keep the number of occurrences of each word on each page'''
page_count = 1572
vectors = np.array([[0 for j in range(len(data.keys()))] for i in range(page_count)], dtype=np.float16)
words = list(data.keys())
word_count = len(words)
const_log_of_d = np.log10(1572)
""" :( """
for i in range(page_count):
for j in range(word_count):
vectors[i][j] = (len(data[words[j]].title[i]) + len(data[words[j]].text[i]))
if i % 50 == 0:
print("i:", i)
Update : I tried this on a friend's computer, this time it killed the process at someplace between the 1350th-1400th iteration.

Push data to google sheet from dataframe

I'm trying to push data into my google sheet with the following code, how can i change the code so that it will print in the 2nd row at the correct column base on the header that I've created.
First code:
class Header:
def __init__(self):
self.No_DOB_Y=1
self.No_DOB_M=2
self.No_DOB_D=3
self.Paid_too_much_little=4
self.No_number_of_ins=5
self.No_gender=6
self.No_first_login=7
self.No_last_login=8
self.Too_young_old=9
def __repr__(self):
return str(self.__dict__)
def add_col(self,name):
setattr(self,name,max(anomali_header.__dict__.values())+1)
anomali_header=Header()
2nd part of code (NEW):
# No_gender
a = list(df.loc[df['gender'].isnull()]['id'])
#print(a)
cells=sh3.range(1,1,len(a),1)
for i,cell in enumerate(cells):
cell.value=a[i]
sh3.update_cells(cells)
At the moment it updates into A1 cell....
This is what I essentially want to
As you can see, the code writes the results onto the first available cell which is A1, i essentially want it to appear at the bottom of my anomali_header of "No_gender" but I'm not sure how to link my 1st part of the code to the 2nd part of the code...
Thanks to v25, the code below works, but rather than going through the code one by one, i wanted to create a loop which goes through all the function
I'm trying to run the code below, but it seems I get an error when I use the loop.
Error:
TypeError: 'list' object cannot be interpreted as an integer
Code:
# No_DOB_Y
a = list(df.loc[df['Year'].isnull()]['id'])
# No number of ins
b = list(df.loc[df['number of ins'].isnull()]['id'])
# No_gender
c = list(df.loc[df['gender'].isnull()]['id'])
# Updating anomalies to sheet
condition = [a,b,c]
column = [1,2,3]
for j in range(column,condition):
cells=sh3.range(2,column,len(condition)+1,column)
for i,cell in enumerate(cells):
cell.value=condition[i]
print('end of check')
sh3.update_cells(cells)
You need to change the range() parameters:
first_row (int) – Row number
first_col (int) – Row number
last_row (int) – Row number
last_col (int) – Row number
So something like:
cells=sh3.range(2, 6, len(a)+1, 6)
Or you could issue the range as a string:
cells=sh3.range('F2:F' + str(len(a)+1))
These numbers may not be perfect, but this should change the positioning. You might need to tweak the digits slightly ;)
UPDATE:
I've encountered an error use a loop, updated my original post
TypeError: 'list' object cannot be interpreted as an integer
This is happneing because the function range which you use in the for loop (not to be confused with sh3.range which is a different function altogether) expects integers, but you're passing it lists.
However, a simpler way to implement this would be to create a list of tuples which map the strings to column integers, then loop based on this. Something like:
col_map = [ ('Year', 1),
('number of ins', 5),
('gender', 6)
]
for col_tup in col_map:
df_list = list(df.loc[df[col_tup[0]].isnull()]['id'])
cells = sh3.range(2, col_tup[1], len(df_list)+1, col_tup[1])
for i, cell in enumerate(cells)
cell.value=df_list[i]
sh3.update_cells(cells)

XLRD: Successfully extracted 2 lists out of 2 sheets, but list comparison won't work

Ok so I have two xlsx sheets, both sheets have in their second column, at index 1, a list of sim card numbers. I have successfully printed the contents of both columns into my powershell terminal as 2 lists, and the quantity of elements in those lists, after extracting that data using xlrd.
The first sheet (theirSheet) has 454 entries, the second (ourSheet) has 361. I need to find the 93 that don't exist in the second sheet and put them into (unpaidSims). I could do this manually of course, but I would like to automate this task for the future when I inevitably need to do it again so I am trying to write this python script.
Considering python agrees that I have a list of 454 entries, and a list of 361 entries, I thought I just need to figure out a list comparison and I researched that on Stack Overflow, and tried 3 times with 3 different solutions, but each time, when I use that script to produce the third list (unpaidSims), it says 454...meaning it hasn't removed the entries that are duplicated in the smaller list. Please advise.
from os.path import join, dirname, abspath
import xlrd
theirBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'theirBook.xlsx')
ourBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'ourBook.xlsx')
theirBook = xlrd.open_workbook(theirBookFileName)
ourBook = xlrd.open_workbook(ourBookFileName)
theirSheet = theirBook.sheet_by_index(0)
ourSheet = ourBook.sheet_by_index(0)
theirSimColumn = theirSheet.col(1)
ourSimColumn = ourSheet.col(1)
numColsTheirSheet = theirSheet.ncols
numRowsTheirSheet = theirSheet.nrows
numColsOurSheet = ourSheet.ncols
numRowsOurSheet = ourSheet.nrows
# First Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = [d for d in theirSimColumn if d not in ourSimColumn]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
print "\nWe are expecting 93 entries in this new list"
# Second Attempt at the comparison, but fails and returns 454 entries from the bigger list
s = set(ourSimColumn)
unpaidSims = [x for x in theirSimColumn if x not in s]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
# Third Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = tuple(set(theirSimColumn) - set(ourSimColumn))
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
According to the xlrd Documentation, the col method returns "a sequence of the Cell objects in the given column".
It doesn't mention anything about comparison of Cell objects. Digging into the source, it appears that they didn't code any comparison methods into the class. As such, the Python documentation states that the objects will be compared by "object identity". In other words, the comparison will be False unless they are the exact same instance of the Cell class, even if the values they contain are identical.
You need to compare the values of the Cells instead. For example:
unpaidSims = set(sim.value for sim in theirSimColumn) - set(sim.value for sim in ourSimColumn)

Instructables open source code: Python IndexError: list index out of range

I've seen this error on several other questions but couldn't find the answer.
{I'm a complete stranger to Python, but I'm following the instructions from a site and I keep getting this error once I try to run the script:
IndexError: list index out of range
Here's the script:
##//txt to stl conversion - 3d printable record
##//by Amanda Ghassaei
##//Dec 2012
##//http://www.instructables.com/id/3D-Printed-Record/
##
##/*
## * This program is free software; you can redistribute it and/or modify
## * it under the terms of the GNU General Public License as published by
## * the Free Software Foundation; either version 3 of the License, or
## * (at your option) any later version.
##*/
import wave
import math
import struct
bitDepth = 8#target bitDepth
frate = 44100#target frame rate
fileName = "bill.wav"#file to be imported (change this)
#read file and get data
w = wave.open(fileName, 'r')
numframes = w.getnframes()
frame = w.readframes(numframes)#w.getnframes()
frameInt = map(ord, list(frame))#turn into array
#separate left and right channels and merge bytes
frameOneChannel = [0]*numframes#initialize list of one channel of wave
for i in range(numframes):
frameOneChannel[i] = frameInt[4*i+1]*2**8+frameInt[4*i]#separate channels and store one channel in new list
if frameOneChannel[i] > 2**15:
frameOneChannel[i] = (frameOneChannel[i]-2**16)
elif frameOneChannel[i] == 2**15:
frameOneChannel[i] = 0
else:
frameOneChannel[i] = frameOneChannel[i]
#convert to string
audioStr = ''
for i in range(numframes):
audioStr += str(frameOneChannel[i])
audioStr += ","#separate elements with comma
fileName = fileName[:-3]#remove .wav extension
text_file = open(fileName+"txt", "w")
text_file.write("%s"%audioStr)
text_file.close()
Thanks a lot,
Leart
Leart - check these it may help:
Is your input file in correct format? As I see it, you need to produce that file before hand before you can use it in this program... Post that file in here as well.
Check if your bitrate and frame rates are correct
Just for debugging purposes (if the code is correct, this may not produce correct results, but good for testing). You are accessing frameInt[4*i+1], with index i multiplied by 4 then adding 1 (going beyond the frameInt index eventually).
Add an 'if' to check size before accessing the array element in frameInt:
if len(frameInt)>=(4*i+1):
Add that statement right after the first occurence of "for i in range(numframes):" and just before "frameOneChannel[i] = frameInt[4*i+1]*2**8+frameInt[4*i]#separate channels and store one channel in new list"
*watch tab spaces

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

Categories

Resources