How to compare strings more efficiently when using fuzzywuzzy?

How to compare strings more efficiently when using fuzzywuzzy? - python

I have a CSV file with ~20000 words and I'd like to group the words by similarity. To complete such task, I am using the fantastic fuzzywuzzy package, which seems to work really well and achieves exactly what I am looking for with a small dataset (~100 words)
The words are actually brand names, this is a sample output from the small dataset that I just mentioned, where I get the similar brands grouped by name:
[
('asos-design', 'asos'),
('m-and-s', 'm-and-s-collection'),
('polo-ralph-lauren', 'ralph-lauren'),
('hugo-boss', 'boss'),
('yves-saint-laurent', 'saint-laurent')
]
Now, my problem with this, is that if I run my current code for the full dataset, it is really slow, and I don't really know how to improve the performance, or how to do it without using 2 for loops.
This is my code.
import csv
from fuzzywuzzy import fuzz
THRESHOLD = 90
possible_matches = []
with open('words.csv', encoding='utf-8') as csvfile:
words = []
reader = csv.reader(csvfile)
for row in reader:
word, x, y, *rest = row
words.append(word)
for i in range(len(words)-1):
for j in range(i+1, len(words)):
if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD:
possible_matches.append((words[i], words[j]))
print(i)
print(possible_matches)
How can I improve the performance?

For 20,000 words, or brands, any approach that compares each word to each other word, i.e. has quadratic complexity O(n²), may be too slow. For 20,000 it may still be barely acceptable, but for any larger data set it will quickly break down.
Instead, you could try to extract some "feature" from your words and group them accordingly. My first idea was to use a stemmer, but since your words are names rather than real words, this will not work. I don't know how representative your sample data is, but you could try to group the words according to their components separated by -, then get the unique non-trivial groups, and you are done.
words = ['asos-design', 'asos', 'm-and-s', 'm-and-s-collection',
'polo-ralph-lauren', 'ralph-lauren', 'hugo-boss', 'boss',
'yves-saint-laurent', 'saint-laurent']
from collections import defaultdict
parts = defaultdict(list)
for word in words:
for part in word.split("-"):
parts[part].append(word)
result = set(tuple(group) for group in parts.values() if len(group) > 1)
Result:
{('asos-design', 'asos'),
('hugo-boss', 'boss'),
('m-and-s', 'm-and-s-collection'),
('polo-ralph-lauren', 'ralph-lauren'),
('yves-saint-laurent', 'saint-laurent')}
You might also want to filter out some stop words first, like and, or keep those together with the words around them. This will probably still yield some false-positives, e.g. with words like polo or collection that may appear with several different brands, but I assume that the same is true for using fuzzywuzzy or similar. A bit of post-processing and manual filtering of the groups may be in order.

Try using list comprehensions instead, it is faster than list.append() method:
with open('words.csv', encoding='utf-8') as csvfile:
words = [row[0] for row in csv.reader(csvfile)]
possible_matches = [(words[i], words[j]) for i in range(len(words)-1) for j in range(i+1, len(words)) if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD]
print(possible_matches)
Unfortunately with this way you can't do a print(i) in each iteration, but assuming you only needed the print(i) for debugging it wouldn't affect your final result.
Converting a loop into a list comprehension is extremely easy, consider you have a loop like this:
for i in iterable_1:
lst.append(something)
The list comprehension becomes:
lst = [something for i in iterable_1]
For nested loops and conditions, just follow the same logic:
iterable_1:
iterable_2:
...
some_condition:
lst.append(something)
# becomes
lst = [something <iterable_1> <iterable_2> ... <some_condition>]
# Or if you have an else clause:
iterable_1:
...
if some_condition:
lst.append(something)
else:
lst.append(something_else)
lst = [something if some_condition else something_else <iterable_1> <iterable_2> ...]

Related

Parse list of strings for speed

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1

I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

Python: fast iteration through file

I need to iterate through two files many million times,
counting the number of appearances of word pairs throughout the files.
(in order to build contingency table of two words to calculate Fisher's Exact Test score)
I'm currently using
from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
if w1 in x:
w1count+=1
if w2 in y:
w2count+=1
.....
While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.
I appreciate your help in advance.

I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.
We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.
import collections
import itertools
import re
def find_words(line):
for match in re.finditer("\w+", line):
yield match.group().lower()
counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()
with open("src.txt") as f1, open("tgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(find_words(line1))
words2 = list(find_words(line2))
counts1.update(words1)
counts2.update(words2)
counts_pairs.update(itertools.product(words1, words2))
print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]

In general if your data is small enough to fit into memory then your best bet is to:
Pre-process data into memory
Iterate from memory structures
If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.

Just as an out of the box thinking solution:
Have you tried making the files into Pandas data frames? I.e. I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?
import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()
df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()
df_1['link'] = 1
df_2['link'] = 1
result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']
I use stuff like this for basket analysis, works really well.

Append Rows of Different Lengths to the Same Variable

I am trying to append a lengthy list of rows to the same variable. It works great for the first thousand or so iterations in the loop (all of which have the same lengths), but then, near the end of the file, the rows get a bit shorter, and while I still want to append them, I am not sure how to handle it.
The script gives me an out of range error, as expected.
Here is what the part of code in question looks like:
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
print NNCatelogue, ii
Any help on this would be greatly appreciated!

I'll answer the question you didn't ask first ;) : how can this code be more pythonic?
Instead of
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
you should do
NNCat = []
NNCatelogue = []
for ii, line in enumerate(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii],
nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
During each pass ii will be incremented by one for you and line will be the current line.
As for your short lines, you have two choices
Use a special value (such as None) to fill in when you don't have a real value
check the length of nn1, nn2, ..., nn11 to see if they are large enough
The second solution will be much more verbose, hard to maintain, and confusing. I strongly recommend using None (or another special value you create yourself) as a placeholder when there is no data.

def gvop(vals,indx): #get values or padding
return vals[indx] if indx<len(vals) else None
NNCatelogue = [(gvop(ev_id,ii), gvop(nn1,ii), gvop(nn2,ii), gvop(nn3,ii), gvop(nn4,ii),
gvop(nn5,ii), gvop(nn6,ii), gvop(nn7,ii), gvop(nn8,ii), gvop(nn9,ii),
gvop(nn10,ii), gvop(nn11,ii)) for ii in xrange(0, len(lines))]
By defining this other function to return either the correct value or padding, you can ensure rows are the same length. You can change the padding to anything, if None is not what you want.
Then the list comp creates a list of tuples as before, except containing padding in cases where some of the lines in the input are shorter.

from itertools import izip_longest
NNCatelogue = list(izip_longest(ev_id, nn1, nn2, ... nn11, fillvalue=None))
See here for documentation of izip. Do yourself a favour and skip the list around the iterator, if you don't need it. In many cases you can use the iterator as well as the list, and you save a lot of memory. Especially if you have long lists, that you're grouping together here.

For data error checking: Is there a way to avoid using a dictionary for a list

I have data that looks like this:
Observation 1
Type : 1
Color: 2
Observation 2
Color: 2
Resolution: 3
Originally what I had done was to attempt to create a csv that looked like:
1,2
2,3 # Only problem here is that the data should look like this 1,2,\n ,2,3 #
I performed the following operation:
while linecache.getline(filename, curline):
for i in range(2):
data_manipulated = linecache.getline(filename, curline).rstrip()
datamanipulated2 = data_manipulated.split(":")
datamanipulated2.pop(0)
lines.append(':'.join(datamanipulated2))
This is quite a large dataset and I tried to find ways to verify that the above problem doesn't happen so that I can compile the data appropriately and with checks. I came across dictionaries, however, performance is a big issue for me and I would prefer lists if that's possible (at least, my understanding is that dictionaries can be significantly slower?). I was just wondering if anyone had any suggestions on the quickest and most robust way to do this?

How about something like:
input_file = open('/path/to/input.file')
results = []
for row in file:
m = re.match('Observation (\d+)', row)
if m:
observation = m.group(1)
continue
m = re.match('Color: (\d+)', row)
if m:
results.append((observation, m.group(1),))
print "{0},{1}".format(*results[-1])
You can speedup using precompiled regular expressions.

Random List of millions of elements in Python Efficiently

I have read this answer potentially as the best way to randomize a list of strings in Python. I'm just wondering then if that's the most efficient way to do it because I have a list of about 30 million elements via the following code:
import json
from sets import Set
from random import shuffle
a = []
for i in range(0,193):
json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
data = json.load(json_data)
for j in range(0,len(data)):
a.append(data[j]['su'])
new = list(Set(a))
print "Cleaned length is: " + str(len(new))
## Take Cleaned List and Randomize it for Analysis
shuffle(new)
If there is a more efficient way to do it, I'd greatly appreciate any advice on how to do it.
Thanks,

A couple of possible suggestions:
import json
from random import shuffle
a = set()
for i in range(193):
with open("C:/Twitter/user/user_{0}.json".format(i)) as json_data:
data = json.load(json_data)
a.update(d['su'] for d in data)
print("Cleaned length is {0}".format(len(a)))
# Take Cleaned List and Randomize it for Analysis
new = list(a)
shuffle(new)
.
the only way to know if this is faster is to profile it!
do you prefer sets.Set to the built-in set() for a reason?
I have introduced a with clause (preferred way of opening files, as it guarantees they get closed)
it did not appear that you were doing anything with 'a' as a list except converting it to a set; why not make it a set from the start?
rather than iterate on an index, then do a lookup on the index, I just iterate on the data items...
which makes it easily rewriteable as a generator expression

If you think you're going to do shuffle, you're probably better off using the solution from this file. For realz.
randomly mix lines of 3 million-line file
Basically the shuffle algorithm has a very low period (meaning it can't hit all the possible combinations of 3 million files, let alone 30 million). If you can load the data in memory then your best bet is as they say. Basically assign a random number to each line and sort that badboy.
See this thread. And here, I did it for you so you didn't mess anything up (that's a joke),
import json
import random
from operator import itemgetter
a = set()
for i in range(0,193):
json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
data = json.load(json_data)
a.update(d['su'] for d in data)
print "Cleaned length is: " + str(len(new))
new = [(random.random(), el) for el in a]
new.sort()
new = map(itemgetter(1), new)

I don't know if it will be any faster but you could try numpy's shuffle.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare strings more efficiently when using fuzzywuzzy? - python

Related

Parse list of strings for speed

Python: fast iteration through file

Append Rows of Different Lengths to the Same Variable

For data error checking: Is there a way to avoid using a dictionary for a list

Random List of millions of elements in Python Efficiently

Categories

Resources