Fuzzy matching not accurate enough with TF-IDF and cosine similarity

Fuzzy matching not accurate enough with TF-IDF and cosine similarity - python

I want to find similarities in a long list of strings. That is for every one string in the list, I need all similar strings in the same list. Earlier I used Fuzzywuzzy which provided good accuracy with the results I wanted by using the fuzzy.partial_token_sort_ratio. The only problem with this is the time it took since the list contains ~50k entries with up to 40 character strings. Time taken went up to 36 hours for 50k strings.
To improve my time I tried the rapidfuzz library which reduced the time to around 12 hours, giving same output as Fuzzywuzzy inspired from an answer here. Later I tried tf-idf and cosine similarity which gave some fantastic time improvements using the string-grouper library inspired from this blog. Closely investigating the results, the string-grouper method missed matches like 'DARTH VADER' and 'VADER' which were caught by fuzzywuzzy and rapidfuzz. This can be understood because of the way TF-IDF works and it seems to miss small strings altogether.
Is there any workaround to improve the matching of string-grouper in this example or improve the time taken by rapidfuzz? Any faster iteration methods? Or any other ways to make the problem work?
The data is preprocessed and contains all strings in CAPS without special characters or numbers.
Time taken per iteration is ~1s. Here is the code for rapidfuzz:
from rapidfuzz import process, utils, fuzz
for index,rows in df.iterrows()
list.append(process.extract(rows['names'],df['names'],scorer=fuzz.partial_token_set_ratio,score_cutoff=80))
Super fast solution, here is the code for string-grouper:
from string_grouper import match_strings
matches=match_strings(df.['names'])
Some similar problems with fuzzywuzzy are discussed here : (Fuzzy string matching in Python)
Also in general, are there any other programming languages that I can shift to, like R which can maybe speed this up? Just curious...
Thanks for your help 😊

It is possible to change the minimum similarity with min_similarity and the size of n-grams with ngram_size in the match_strings function in string-grouper. For the specific example you could use a higher ngram_size, but that might cause you too miss other hits again.

You should give tfidf-matcher a try, it didn't work for my specific use case but it might be a good fit for you.

tfidf matcher worked wonderfully for me. No hassle, just one function to call + you can set how many ngrams you'd like to split the word into, and the number of close matches you'd like + a confidence value in the match. It's also fast enough: looking up a string in a dataset of around 230k words took around 3 seconds at most.

Related

Statistical reasoning: how and why does tf.keras.preprocessing.sequence skipgrams use sampling_table this way?

The sampling_table parameter is only used in the tf.keras.preprocessing.sequence.skipgrams method once to test if the probability of the target word in the sampling_table is smaller than some random number drawn from 0 to 1 (random.random()).
If you have a large vocabulary and a sentence that uses a lot of infrequent words, doesn't this cause the method to skip a lot of the infrequent words in creating skipgrams? Given the values of a sampling_table that is log-linear like a zipf distribution, doesn't this mean you can end up with no skip grams at all?
Very confused by this. I am trying to replicate the Word2Vec tutorial hand don't understand or how the sampling_table is being used.
In the source code, this is the lines in question:
if sampling_table[wi] < random.random():
continue

This looks like the frequent-word-downsampling feature common in word2vec implementations. (In the original Google word2vec.c code release, and the Python Gensim library, it's adjusted by the sample parameter.)
In practice, it's likely sampling_table has been precalculated so that the rarest words are always used, common words skipped a little, and the very-most-common words skipped a lot.
That seems to be the intent reflected by the comment for make_sample_table().
You could go ahead and call that with a probe value, like say 1000 for a 100-word vocabulary, and see what sampleing_table it gives back. I suspect it'll be numbers close to 1.0 early (drop lots of common words), and close to 0.0 late (keep most/all rare words).
This tends to improve word-vector quality, by reserving more relative attention for medium- and low-frequency words, and not exessively overtraining/overweighting plentiful words.

compare similarity of multiple texts using python

So i have about 300-500 text articles that i would like to compare the similarity of and figure which are related / duplicates some articles might be addressing the same topics but not identical. so to tackle this i started experimenting with spaCy and the similarity function .. now the problem is similarity only compares two documents at a time and I think i would need to loop every single text and to compare it to the other one which is a very slow and memory consuming process is there a way around this ?

I don't know how you are going to go about comparing similarities between texts, but let's say that you are going to compare each one to another using Jaccard or cosine similarities.
Then, you could use the all-pairs similarity search proposed in this paper which has an implementation here. This algorithm is extremely fast, especially for such a small data size.
The all-pairs search returns two documents and their similarity, so if you want to find a "family" of similar documents, then you will further need to apply a graph traversal like DFS. A stack overflow post on python tuples uses adjacency lists and provides O^(n+m) time complexity.
Here's an example where you could use the all-pairs algorithm that tries to find reposts in the reddit jokes subreddit.

How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.
If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.
Below is the code I'm running.
d = []
n = len(Africa_Company) #original list with 2m string records
for i in range(1,n):
choices = Africa_Company[i+1:n]
word = Africa_Company[i]
try:
output= process.extractOne(str(word), str(choices), score_cutoff=85)
except Exception:
print (word) #to identify which string is throwing an exception
print (i) #to know how many rows are processed, can do without this also
if output:
d.append({'Company':Africa_Company[i],
'NewCompany':output[0],
'Score':output[1],
'Region':'Africa'})
else:
d.append({'Company':Africa_Company[i],
'NewCompany':None,
'Score':None,
'Region':'Africa'})
Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe
Thanks in advance !

This is a CPU-bound problem. By going parallel you can just speed it up by a factor of two at most (because you have two cores). What you really should do is speed up the single-thread performance. Levenshtein distance is quite slow so there are lots of opportunity to speed things up.
Use pruning. Don't try to run the full fuzzywuzzy match between two strings if there is no way it will give a good result. Try to find a simple linear algorithm to filter out irrelevant choices before the fuzzywuzzy match.
Consider indexing. Is there some way you can index your list? For example: if your matching is based on whole words, create a hashmap that maps words to strings. Only try to match against choices that have at least one word in common with your current string.
Preprocessing. Is there some work done on the strings in every match that you can preprocess? If, for example, your Levenshtein implementation starts by creating sets out of your strings, consider creating all sets first so you don't have to do the same work over and over in each match.
Is there some better algorithm to use? Maybe Levenshtein distance is not the best algorithm to begin with.
Is the implementation of Levenshtein distance you're using optimal? This goes back to step 3 (preprocessing). Are there other things you can do to speed up the runtime?
Multiprocessing will only speed up with a constant factor (depending on the number of cores). Indexing can take you to a lower complexity class! So focus on pruning and indexing first, then steps 3-5. Only when you squeezed enough out of these steps should you consider multiprocessing.

How to effeciently find all fuzzy matches between a set of terms and a list of sentences?

I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.
How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?
Would the reverse search: FTS matching terms in sentence be faster?

If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.
If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.

How do I use kd-trees for determining string similarity?

I am trying to utilize k-nearest neighbors for the string similarity problem i.e. given a string and a knowledge base, I want to output k strings that are similar to my given string. Are there any tutorials that explain how to utilize kd-trees to efficiently do this k-nearest neighbor lookup for strings? The string length will not exceed more than 20 characters.

Probably one of the hottest blog posts I had read a year or so ago: Levenstein Automata. Take a look at that article. It provides not only a description of the algorithm but also code to follow. Technically, it's not a kd-tree but it's quite related to the string matching and dictionary correction algorithms one might encounter/use in the real world.
He also has another blog post about BK-trees which are much better at the fuzzy matching for strings and string look ups where there are mispellings. Here is another resource containing source code for a BK-tree (this one I can't verify the accuracy or proper implementation.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.