summary: I am searching for misspellings between a bunch of data and it is taking forever
I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling
code:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
df_temp = df
df_temp['jelly'] = row
df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
df_agg = df_temp[df_temp['difference'] <3]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for row in data['z']['json_search_string']:
#levenshtein algorithm above
The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?
Notes:
Here is an example of what the data might look like:
JSON data
{
"A": {
"email": "blah",
"name": "Joe Blah",
"json_search_string": [
"Company A",
"Some random company name",
"Company B",
"etc",
"..."
And the csv columns:
ID, Search Column, Other Columns
1, Clompany A, XYZ
2, Company A, XYZ
3, Some misspelled company, XYZ
etc
Well, it is really hard to answer performance enhancement question.
Depending on the effort and performance, here are some suggestions.
Small tweaking by re-arrangement of your code logic. Effort: small. Expected enhancement: small. By going through your code, I know that you are going to comparing words from File (number 20) with a fixed JSON File (only one). Instead of reading the JSON File for each File, why not first prepare the fixed words list from the JSON File, and used it for all following comparison? The logic is like:
# prepare fixed words from JSON DATA
fixed_words = []
for v in json_data.values():
fixed_words += v["json_search_string"]
# looping over each file, and compare them with word from fixed words
for f in file_list:
# do the comparison and save.
Using multiprocessing. Effort: Small. Expected Enhancement: Median. Since all your work are similar, why not try multiprocessing? You could apply multiprocessing on each file OR when doing dataframe.apply. There are lots of source for multiprocessing, please have a look. It is easy to implement for your case.
Using other languages to implement Levenshtein distance. The bottleneck of your code is the computing of Levenshtein distance. You used the jellyfish python package, which is a pure python (of course, performance is not good for a large set). Here are some other options:
a. Already existed python package with C/C++ implementation. Effort: small. Expected Enhancement: High. Thanks the comment from #Corley Brigman , editdistance is one option you can use.
b. Self-implementation by Cyphon. Effort: Median. Enhancement: Median or High. Check pandas document Performance
c. Self-implementation by C/C++ as a wrapper. Effort: High; Expected Enhancement: High. Check Wrapping with C/C++
You could use several of my suggestion to gain higher performance.
Hope this would be helpful.
You could change your code to :
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
x_search = x['search column']
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
mask = [jellyfish.levenshtein_distance(s1,s2) < 3 for s1,s2 in zip(x_search, row) ]
df_agg = df_temp[mask]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
apply return a copy of a serie which can be more expensive:
a = range(10**4)
b = range(10**4,2*(10**4))
%timeit [ (x*y) <3 for x,y in zip(a,b)]
%timeit pd.DataFrame([a,b]).apply(lambda x: x[0]*x[1] < 3 )
1000 loops, best of 3: 1.23 ms per loop
1 loop, best of 3: 668 ms per loop
Related
This seems like it should be very simple but am not sure the proper syntax in Python. To streamline my code I want a while loop (or for loop if better) to cycle through 9 datasets and use the counter to call each file out using the counter as a way to call on correct file.
I would like to use the "i" variable within the while loop so that for each file with sequential names I can get the average of 2 arrays, the max-min of this delta, and the max-min of another array.
Example code of what I am trying to do but the avg(i) and calling out temp(i) in loop does not seem proper. Thank you very much for any help and I will continue to look for solutions but am unsure how to best phrase this to search for them.
temp1 = pd.read_excel("/content/113VW.xlsx")
temp2 = pd.read_excel("/content/113W6.xlsx")
..-> temp9
i=1
while i<=9
avg(i) =np.mean(np.array([temp(i)['CC_H='],temp(i)['CC_V=']]),axis=0)
Delta(i)=(np.max(avg(i)))-(np.min(avg(i)))
deltaT(i)=(np.max(temp(i)['temperature='])-np.min(temp(i)['temperature=']))
i+= 1
EG: The slow method would be repeating code this for each file
avg1 =np.mean(np.array([temp1['CC_H='],temp1['CC_V=']]),axis=0)
Delta1=(np.max(avg1))-(np.min(avg1))
deltaT1=(np.max(temp1['temperature='])-np.min(temp1['temperature=']))
avg2 =np.mean(np.array([temp2['CC_H='],temp2['CC_V=']]),axis=0)
Delta2=(np.max(avg2))-(np.min(avg2))
deltaT2=(np.max(temp2['temperature='])-np.min(temp2['temperature=']))
......
Think of things in terms of lists.
temps = []
for name in ('113VW','113W6',...):
temps.append( pd.read_excel(f"/content/{name}.xlsx") )
avg = []
Delta = []
deltaT = []
for data in temps:
avg.append(np.mean(np.array([data['CC_H='],data['CC_V=']]),axis=0)
Delta.append(np.max(avg[-1]))-(np.min(avg[-1]))
deltaT.append((np.max(data['temperature='])-np.min(data['temperature=']))
You could just do your computations inside the first loop, if you don't need the dataframes after that point.
The way that I would tackle this problem would be to create a list of filenames, and then iterate through them to do the necessary calculations as per the following:
import pandas as pd
# Place the files to read into this list
files_to_read = ["/content/113VW.xlsx", "/content/113W6.xlsx"]
results = []
for i, filename in enumerate(files_to_read):
temp = pd.read_excel(filename)
avg_val =np.mean(np.array([temp(i)['CC_H='],temp['CC_V=']]),axis=0)
Delta=(np.max(avg_val))-(np.min(avg_val))
deltaT=(np.max(temp['temperature='])-np.min(temp['temperature=']))
results.append({"avg":avg_val, "Delta":Delta, "deltaT":deltaT})
# Create a dataframe to show the results
df = pd.DataFrame(results)
print(df)
I have included the enumerate feature to grab the index (or i) should you want to access it for anything, or include it in the results. For example, you could change the the results.append line to something like this:
results.append({"index":i, "Filename":filename, "avg":avg_val, "Delta":Delta, "deltaT":deltaT})
Not sure if I understood the question correctly. But if you want to read the files inside a loop using indexes (i variable), you can create a list to hold the contents of the excel files instead of using 9 different variables.
something like
files = []
files.append(pd.read_excel("/content/113VW.xlsx"))
files.append(pd.read_excel("/content/113W6.xlsx"))
...
then use the index variable to iterate over the list
i=1
while i<=9
avg(i) = np.mean(np.array([files[i]['CC_H='],files[i]['CC_V=']]),axis=0)
...
i+=1
P.S.: I am not a Pandas/NumPy expert, so you may have to adapt the code to your needs
I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini
I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})
I have a csv file and the data pattern like this:
I am importing it from csv file. In input data, there are some whitespaces and I am handling it by using pattern as above. For output, I want to write a function that takes this file as an input and prints the lowest and highest blood pressure. Also, it will return average of all mean values. On the other side, I should not use pandas.
I wrote below code blog.
bloods=open(bloodfilename).read().split("\n")
blood_pressure=bloods[4].split(",")[1]
pattern=r"\s*(\d+)\s*\[\s*(\d+)-(\d+)\s*\]"
re.findall(pattern,blood_pressure)
#now extract mean, min and max information from the blood_pressure of each patinet and write a new file called blood_pressure_modified.csv
pattern=r"\s*(\d+)\s*\[\s*(\d+)-(\d+)\s*\]"
outputfilename="blood_pressure_modified.csv"
# create a writeable file
outputfile=open(outputfilename,"w")
for blood in bloods:
patient_id, blood_pressure=bloods.strip.split(",")
mean=re.findall(pattern,blood_pressure)[0]
blood_pressure_modified=re.sub(pattern,"",blood_pressure)
print(patient_id, blood_pressure_modified, mean, sep=",", file=outputfile)
outputfile.close()
Output should looks like this:
This is a very simple kind of answer to this. No regex, pandas or anything.
Let me know if this is working. I can try making it work better for any case it doesn't work.
bloods=open("bloodfilename.csv").read().split("\n")
means = []
'''
Also, rather than having two list of mins and maxs,
we can have just one and taking min and max from this
list later would do the same trick. But just for clarity I kept both.
'''
mins = []
maxs = []
for val in bloods[1:]: #assuming first line is header of the csv
mean, ranges = val.split(',')[1].split('[')
means.append(int(mean.strip()))
first, second = ranges.split(']')[0].split('-')
mins.append(int(first.strip()))
maxs.append(int(second.strip()))
print(f'the lowest and the highest blood pressure are: {min(mins)} {max(maxs)} respectively\naverage of mean values is {sum(means)/len(means)}')
You can also create functions to perform small small strip stuff. That's usually a better way to code. I wrote this in bit hurry, so don't mind.
Maybe this could help with your question,
Suppose you have a CSV file like this, and want to extract only the min and max values,
SN Number
1 135[113-166]
2 140[110-155]
3 132[108-180]
4 40[130-178]
5 133[118-160]
Then,
import pandas as pd
df = pd.read_csv("REPLACE_WITH_YOUR_FILE_NAME.csv")
results = df['YOUR_NUMBER_COLUMN'].apply(lambda x: x.split("[")[1].strip("]").split("-"))
with open("results.csv", mode="w") as f:
f.write("MIN," + "MAX")
f.write("\n")
for i in results:
f.write(str(i[0]) + "," + str(i[1]))
f.write("\n")
f.close()
After you ran the snippet after without any errors then in your current working directory their should be a file named results.csv Open it up and you will have the results
I need to iterate through two files many million times,
counting the number of appearances of word pairs throughout the files.
(in order to build contingency table of two words to calculate Fisher's Exact Test score)
I'm currently using
from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
if w1 in x:
w1count+=1
if w2 in y:
w2count+=1
.....
While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.
I appreciate your help in advance.
I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.
We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.
import collections
import itertools
import re
def find_words(line):
for match in re.finditer("\w+", line):
yield match.group().lower()
counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()
with open("src.txt") as f1, open("tgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(find_words(line1))
words2 = list(find_words(line2))
counts1.update(words1)
counts2.update(words2)
counts_pairs.update(itertools.product(words1, words2))
print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]
In general if your data is small enough to fit into memory then your best bet is to:
Pre-process data into memory
Iterate from memory structures
If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.
Just as an out of the box thinking solution:
Have you tried making the files into Pandas data frames? I.e. I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?
import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()
df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()
df_1['link'] = 1
df_2['link'] = 1
result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']
I use stuff like this for basket analysis, works really well.
I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))