compare similarity of multiple texts using python - python

So i have about 300-500 text articles that i would like to compare the similarity of and figure which are related / duplicates some articles might be addressing the same topics but not identical. so to tackle this i started experimenting with spaCy and the similarity function .. now the problem is similarity only compares two documents at a time and I think i would need to loop every single text and to compare it to the other one which is a very slow and memory consuming process is there a way around this ?

I don't know how you are going to go about comparing similarities between texts, but let's say that you are going to compare each one to another using Jaccard or cosine similarities.
Then, you could use the all-pairs similarity search proposed in this paper which has an implementation here. This algorithm is extremely fast, especially for such a small data size.
The all-pairs search returns two documents and their similarity, so if you want to find a "family" of similar documents, then you will further need to apply a graph traversal like DFS. A stack overflow post on python tuples uses adjacency lists and provides O^(n+m) time complexity.
Here's an example where you could use the all-pairs algorithm that tries to find reposts in the reddit jokes subreddit.

Related

Determining cosine similarity for large datasets

I am currently using a dataset of over 2.5 million images, of which I use the image itself as a comparison to eachother, for use in a content-based recommendation engine.
I use the following code to calculate the cosine similarity using some precomputed embeddings.
cosine_similarity = 1-pairwise_distances(embeddings, metric='cosine')
However my issue is that currently I've estimated requiring around 11,000GB in memory to create this similarity matrix;
Are there any alternatives to getting a similarity metric between every data point in my dataset or is there another way to go about this whole process?
You have 2,500,000 entries. So resulting matrix has 6.25e+12 entries. You need to ask yourself what do you plan to do with this data, compute only what you need, and then the storage will follow. Computation of cosine distance is almost free (it is literally a dot product) so you can always just do it "on the fly", no need to precompute, and the question really boils down to how much actual time/compute.
if you have a recommendation business problem using these 2.5 million images, you may want to check TF recommenders which basically use %30 of data for retrieval and you can run a second ranking classifier on top of the initial model to explore more. this two-step approach would be key to memory constraints and already battle tested by instagram and others

Fuzzy matching not accurate enough with TF-IDF and cosine similarity

I want to find similarities in a long list of strings. That is for every one string in the list, I need all similar strings in the same list. Earlier I used Fuzzywuzzy which provided good accuracy with the results I wanted by using the fuzzy.partial_token_sort_ratio. The only problem with this is the time it took since the list contains ~50k entries with up to 40 character strings. Time taken went up to 36 hours for 50k strings.
To improve my time I tried the rapidfuzz library which reduced the time to around 12 hours, giving same output as Fuzzywuzzy inspired from an answer here. Later I tried tf-idf and cosine similarity which gave some fantastic time improvements using the string-grouper library inspired from this blog. Closely investigating the results, the string-grouper method missed matches like 'DARTH VADER' and 'VADER' which were caught by fuzzywuzzy and rapidfuzz. This can be understood because of the way TF-IDF works and it seems to miss small strings altogether.
Is there any workaround to improve the matching of string-grouper in this example or improve the time taken by rapidfuzz? Any faster iteration methods? Or any other ways to make the problem work?
The data is preprocessed and contains all strings in CAPS without special characters or numbers.
Time taken per iteration is ~1s. Here is the code for rapidfuzz:
from rapidfuzz import process, utils, fuzz
for index,rows in df.iterrows()
list.append(process.extract(rows['names'],df['names'],scorer=fuzz.partial_token_set_ratio,score_cutoff=80))
Super fast solution, here is the code for string-grouper:
from string_grouper import match_strings
matches=match_strings(df.['names'])
Some similar problems with fuzzywuzzy are discussed here : (Fuzzy string matching in Python)
Also in general, are there any other programming languages that I can shift to, like R which can maybe speed this up? Just curious...
Thanks for your help 😊
It is possible to change the minimum similarity with min_similarity and the size of n-grams with ngram_size in the match_strings function in string-grouper. For the specific example you could use a higher ngram_size, but that might cause you too miss other hits again.
You should give tfidf-matcher a try, it didn't work for my specific use case but it might be a good fit for you.
tfidf matcher worked wonderfully for me. No hassle, just one function to call + you can set how many ngrams you'd like to split the word into, and the number of close matches you'd like + a confidence value in the match. It's also fast enough: looking up a string in a dataset of around 230k words took around 3 seconds at most.

Word clustering in python

How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.
Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.
Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.

How do I use kd-trees for determining string similarity?

I am trying to utilize k-nearest neighbors for the string similarity problem i.e. given a string and a knowledge base, I want to output k strings that are similar to my given string. Are there any tutorials that explain how to utilize kd-trees to efficiently do this k-nearest neighbor lookup for strings? The string length will not exceed more than 20 characters.
Probably one of the hottest blog posts I had read a year or so ago: Levenstein Automata. Take a look at that article. It provides not only a description of the algorithm but also code to follow. Technically, it's not a kd-tree but it's quite related to the string matching and dictionary correction algorithms one might encounter/use in the real world.
He also has another blog post about BK-trees which are much better at the fuzzy matching for strings and string look ups where there are mispellings. Here is another resource containing source code for a BK-tree (this one I can't verify the accuracy or proper implementation.)

passing text through a dictionary in Python

I currently have python code that compares two texts using the cosine similarity measure. I got the code here.
What I want to do is take the two texts and pass them through a dictionary (not a python dictionary, just a dictionary of words) first before calculating the similarity measure. The dictionary will just be a list of words, although it will be a large list. I know it shouldn't be hard and I could maybe stumble my way through something, but I would like it to be efficient too. Thanks.
If the dictionary fites in memory, use a Python set:
ok_words = set(["a", "b", "c", "e"])
def filter_words(words):
return [word for word in words if word in ok_words]
If it doesn't fit in memory, you can use shelve
The structure you try to create is known as Inverted Index. Here you can find some general information about it and snippets from Heaps and Mills's implementation. Unfortunately, I wasn't able to find it's source, as well as any other efficient implementation. (Please leave comment if you will find any.)
If you haven't a goal to create a library in pure Python, you can use PyLucene - Python extension for accessing Lucene, which is in it's turn very powerful search engine in Java. Lucene implements inverted index and can easily provide you information on word frequency. It also supports wide range of analyzers (parsers + stemmers) for a dozen of languages.
(Also note, that Lucene already has it's own Similarity measure class.)
Some words about similarity and Vector Space Models. It is very powerful abstraction, but your implementation suffers several disadvantages. With a growth of number of documents in your index your co-occurrence matrix will became to big to fit in memory, and searching in it will take a long time. To stop this effect dimension reduction is used. In methods like LSA this is done by Singular Value Decomposition. Also pay attention to such techniques as PLSA, which uses probabilistic theory, and Random Indexing, which is the only incremental (and so the only appropriate for the large indexes) VSM method.

Categories

Resources