Python: fast iteration through file

Python: fast iteration through file - python

I need to iterate through two files many million times,
counting the number of appearances of word pairs throughout the files.
(in order to build contingency table of two words to calculate Fisher's Exact Test score)
I'm currently using
from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
if w1 in x:
w1count+=1
if w2 in y:
w2count+=1
.....
While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.
I appreciate your help in advance.

I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.
We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.
import collections
import itertools
import re
def find_words(line):
for match in re.finditer("\w+", line):
yield match.group().lower()
counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()
with open("src.txt") as f1, open("tgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(find_words(line1))
words2 = list(find_words(line2))
counts1.update(words1)
counts2.update(words2)
counts_pairs.update(itertools.product(words1, words2))
print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]

In general if your data is small enough to fit into memory then your best bet is to:
Pre-process data into memory
Iterate from memory structures
If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.

Just as an out of the box thinking solution:
Have you tried making the files into Pandas data frames? I.e. I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?
import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()
df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()
df_1['link'] = 1
df_2['link'] = 1
result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']
I use stuff like this for basket analysis, works really well.

Related

Python: Use the "i" counter in while loop as digit for expressions

This seems like it should be very simple but am not sure the proper syntax in Python. To streamline my code I want a while loop (or for loop if better) to cycle through 9 datasets and use the counter to call each file out using the counter as a way to call on correct file.
I would like to use the "i" variable within the while loop so that for each file with sequential names I can get the average of 2 arrays, the max-min of this delta, and the max-min of another array.
Example code of what I am trying to do but the avg(i) and calling out temp(i) in loop does not seem proper. Thank you very much for any help and I will continue to look for solutions but am unsure how to best phrase this to search for them.
temp1 = pd.read_excel("/content/113VW.xlsx")
temp2 = pd.read_excel("/content/113W6.xlsx")
..-> temp9
i=1
while i<=9
avg(i) =np.mean(np.array([temp(i)['CC_H='],temp(i)['CC_V=']]),axis=0)
Delta(i)=(np.max(avg(i)))-(np.min(avg(i)))
deltaT(i)=(np.max(temp(i)['temperature='])-np.min(temp(i)['temperature=']))
i+= 1
EG: The slow method would be repeating code this for each file
avg1 =np.mean(np.array([temp1['CC_H='],temp1['CC_V=']]),axis=0)
Delta1=(np.max(avg1))-(np.min(avg1))
deltaT1=(np.max(temp1['temperature='])-np.min(temp1['temperature=']))
avg2 =np.mean(np.array([temp2['CC_H='],temp2['CC_V=']]),axis=0)
Delta2=(np.max(avg2))-(np.min(avg2))
deltaT2=(np.max(temp2['temperature='])-np.min(temp2['temperature=']))
......

Think of things in terms of lists.
temps = []
for name in ('113VW','113W6',...):
temps.append( pd.read_excel(f"/content/{name}.xlsx") )
avg = []
Delta = []
deltaT = []
for data in temps:
avg.append(np.mean(np.array([data['CC_H='],data['CC_V=']]),axis=0)
Delta.append(np.max(avg[-1]))-(np.min(avg[-1]))
deltaT.append((np.max(data['temperature='])-np.min(data['temperature=']))
You could just do your computations inside the first loop, if you don't need the dataframes after that point.

The way that I would tackle this problem would be to create a list of filenames, and then iterate through them to do the necessary calculations as per the following:
import pandas as pd
# Place the files to read into this list
files_to_read = ["/content/113VW.xlsx", "/content/113W6.xlsx"]
results = []
for i, filename in enumerate(files_to_read):
temp = pd.read_excel(filename)
avg_val =np.mean(np.array([temp(i)['CC_H='],temp['CC_V=']]),axis=0)
Delta=(np.max(avg_val))-(np.min(avg_val))
deltaT=(np.max(temp['temperature='])-np.min(temp['temperature=']))
results.append({"avg":avg_val, "Delta":Delta, "deltaT":deltaT})
# Create a dataframe to show the results
df = pd.DataFrame(results)
print(df)
I have included the enumerate feature to grab the index (or i) should you want to access it for anything, or include it in the results. For example, you could change the the results.append line to something like this:
results.append({"index":i, "Filename":filename, "avg":avg_val, "Delta":Delta, "deltaT":deltaT})

Not sure if I understood the question correctly. But if you want to read the files inside a loop using indexes (i variable), you can create a list to hold the contents of the excel files instead of using 9 different variables.
something like
files = []
files.append(pd.read_excel("/content/113VW.xlsx"))
files.append(pd.read_excel("/content/113W6.xlsx"))
...
then use the index variable to iterate over the list
i=1
while i<=9
avg(i) = np.mean(np.array([files[i]['CC_H='],files[i]['CC_V=']]),axis=0)
...
i+=1
P.S.: I am not a Pandas/NumPy expert, so you may have to adapt the code to your needs

How to compare strings more efficiently when using fuzzywuzzy?

I have a CSV file with ~20000 words and I'd like to group the words by similarity. To complete such task, I am using the fantastic fuzzywuzzy package, which seems to work really well and achieves exactly what I am looking for with a small dataset (~100 words)
The words are actually brand names, this is a sample output from the small dataset that I just mentioned, where I get the similar brands grouped by name:
[
('asos-design', 'asos'),
('m-and-s', 'm-and-s-collection'),
('polo-ralph-lauren', 'ralph-lauren'),
('hugo-boss', 'boss'),
('yves-saint-laurent', 'saint-laurent')
]
Now, my problem with this, is that if I run my current code for the full dataset, it is really slow, and I don't really know how to improve the performance, or how to do it without using 2 for loops.
This is my code.
import csv
from fuzzywuzzy import fuzz
THRESHOLD = 90
possible_matches = []
with open('words.csv', encoding='utf-8') as csvfile:
words = []
reader = csv.reader(csvfile)
for row in reader:
word, x, y, *rest = row
words.append(word)
for i in range(len(words)-1):
for j in range(i+1, len(words)):
if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD:
possible_matches.append((words[i], words[j]))
print(i)
print(possible_matches)
How can I improve the performance?

For 20,000 words, or brands, any approach that compares each word to each other word, i.e. has quadratic complexity O(n²), may be too slow. For 20,000 it may still be barely acceptable, but for any larger data set it will quickly break down.
Instead, you could try to extract some "feature" from your words and group them accordingly. My first idea was to use a stemmer, but since your words are names rather than real words, this will not work. I don't know how representative your sample data is, but you could try to group the words according to their components separated by -, then get the unique non-trivial groups, and you are done.
words = ['asos-design', 'asos', 'm-and-s', 'm-and-s-collection',
'polo-ralph-lauren', 'ralph-lauren', 'hugo-boss', 'boss',
'yves-saint-laurent', 'saint-laurent']
from collections import defaultdict
parts = defaultdict(list)
for word in words:
for part in word.split("-"):
parts[part].append(word)
result = set(tuple(group) for group in parts.values() if len(group) > 1)
Result:
{('asos-design', 'asos'),
('hugo-boss', 'boss'),
('m-and-s', 'm-and-s-collection'),
('polo-ralph-lauren', 'ralph-lauren'),
('yves-saint-laurent', 'saint-laurent')}
You might also want to filter out some stop words first, like and, or keep those together with the words around them. This will probably still yield some false-positives, e.g. with words like polo or collection that may appear with several different brands, but I assume that the same is true for using fuzzywuzzy or similar. A bit of post-processing and manual filtering of the groups may be in order.

Try using list comprehensions instead, it is faster than list.append() method:
with open('words.csv', encoding='utf-8') as csvfile:
words = [row[0] for row in csv.reader(csvfile)]
possible_matches = [(words[i], words[j]) for i in range(len(words)-1) for j in range(i+1, len(words)) if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD]
print(possible_matches)
Unfortunately with this way you can't do a print(i) in each iteration, but assuming you only needed the print(i) for debugging it wouldn't affect your final result.
Converting a loop into a list comprehension is extremely easy, consider you have a loop like this:
for i in iterable_1:
lst.append(something)
The list comprehension becomes:
lst = [something for i in iterable_1]
For nested loops and conditions, just follow the same logic:
iterable_1:
iterable_2:
...
some_condition:
lst.append(something)
# becomes
lst = [something <iterable_1> <iterable_2> ... <some_condition>]
# Or if you have an else clause:
iterable_1:
...
if some_condition:
lst.append(something)
else:
lst.append(something_else)
lst = [something if some_condition else something_else <iterable_1> <iterable_2> ...]

pandas algorithm slow: for loops and lambda

summary: I am searching for misspellings between a bunch of data and it is taking forever
I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling
code:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
df_temp = df
df_temp['jelly'] = row
df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
df_agg = df_temp[df_temp['difference'] <3]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for row in data['z']['json_search_string']:
#levenshtein algorithm above
The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?
Notes:
Here is an example of what the data might look like:
JSON data
{
"A": {
"email": "blah",
"name": "Joe Blah",
"json_search_string": [
"Company A",
"Some random company name",
"Company B",
"etc",
"..."
And the csv columns:
ID, Search Column, Other Columns
1, Clompany A, XYZ
2, Company A, XYZ
3, Some misspelled company, XYZ
etc

Well, it is really hard to answer performance enhancement question.
Depending on the effort and performance, here are some suggestions.
Small tweaking by re-arrangement of your code logic. Effort: small. Expected enhancement: small. By going through your code, I know that you are going to comparing words from File (number 20) with a fixed JSON File (only one). Instead of reading the JSON File for each File, why not first prepare the fixed words list from the JSON File, and used it for all following comparison? The logic is like:
# prepare fixed words from JSON DATA
fixed_words = []
for v in json_data.values():
fixed_words += v["json_search_string"]
# looping over each file, and compare them with word from fixed words
for f in file_list:
# do the comparison and save.
Using multiprocessing. Effort: Small. Expected Enhancement: Median. Since all your work are similar, why not try multiprocessing? You could apply multiprocessing on each file OR when doing dataframe.apply. There are lots of source for multiprocessing, please have a look. It is easy to implement for your case.
Using other languages to implement Levenshtein distance. The bottleneck of your code is the computing of Levenshtein distance. You used the jellyfish python package, which is a pure python (of course, performance is not good for a large set). Here are some other options:
a. Already existed python package with C/C++ implementation. Effort: small. Expected Enhancement: High. Thanks the comment from #Corley Brigman , editdistance is one option you can use.
b. Self-implementation by Cyphon. Effort: Median. Enhancement: Median or High. Check pandas document Performance
c. Self-implementation by C/C++ as a wrapper. Effort: High; Expected Enhancement: High. Check Wrapping with C/C++
You could use several of my suggestion to gain higher performance.
Hope this would be helpful.

You could change your code to :
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
x_search = x['search column']
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
mask = [jellyfish.levenshtein_distance(s1,s2) < 3 for s1,s2 in zip(x_search, row) ]
df_agg = df_temp[mask]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
apply return a copy of a serie which can be more expensive:
a = range(10**4)
b = range(10**4,2*(10**4))
%timeit [ (x*y) <3 for x,y in zip(a,b)]
%timeit pd.DataFrame([a,b]).apply(lambda x: x[0]*x[1] < 3 )
1000 loops, best of 3: 1.23 ms per loop
1 loop, best of 3: 668 ms per loop

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A

How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()

difflib is quite efficient: http://docs.python.org/library/difflib.html

Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

Random List of millions of elements in Python Efficiently

I have read this answer potentially as the best way to randomize a list of strings in Python. I'm just wondering then if that's the most efficient way to do it because I have a list of about 30 million elements via the following code:
import json
from sets import Set
from random import shuffle
a = []
for i in range(0,193):
json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
data = json.load(json_data)
for j in range(0,len(data)):
a.append(data[j]['su'])
new = list(Set(a))
print "Cleaned length is: " + str(len(new))
## Take Cleaned List and Randomize it for Analysis
shuffle(new)
If there is a more efficient way to do it, I'd greatly appreciate any advice on how to do it.
Thanks,

A couple of possible suggestions:
import json
from random import shuffle
a = set()
for i in range(193):
with open("C:/Twitter/user/user_{0}.json".format(i)) as json_data:
data = json.load(json_data)
a.update(d['su'] for d in data)
print("Cleaned length is {0}".format(len(a)))
# Take Cleaned List and Randomize it for Analysis
new = list(a)
shuffle(new)
.
the only way to know if this is faster is to profile it!
do you prefer sets.Set to the built-in set() for a reason?
I have introduced a with clause (preferred way of opening files, as it guarantees they get closed)
it did not appear that you were doing anything with 'a' as a list except converting it to a set; why not make it a set from the start?
rather than iterate on an index, then do a lookup on the index, I just iterate on the data items...
which makes it easily rewriteable as a generator expression

If you think you're going to do shuffle, you're probably better off using the solution from this file. For realz.
randomly mix lines of 3 million-line file
Basically the shuffle algorithm has a very low period (meaning it can't hit all the possible combinations of 3 million files, let alone 30 million). If you can load the data in memory then your best bet is as they say. Basically assign a random number to each line and sort that badboy.
See this thread. And here, I did it for you so you didn't mess anything up (that's a joke),
import json
import random
from operator import itemgetter
a = set()
for i in range(0,193):
json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
data = json.load(json_data)
a.update(d['su'] for d in data)
print "Cleaned length is: " + str(len(new))
new = [(random.random(), el) for el in a]
new.sort()
new = map(itemgetter(1), new)

I don't know if it will be any faster but you could try numpy's shuffle.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.