I currently have python code that compares two texts using the cosine similarity measure. I got the code here.
What I want to do is take the two texts and pass them through a dictionary (not a python dictionary, just a dictionary of words) first before calculating the similarity measure. The dictionary will just be a list of words, although it will be a large list. I know it shouldn't be hard and I could maybe stumble my way through something, but I would like it to be efficient too. Thanks.
If the dictionary fites in memory, use a Python set:
ok_words = set(["a", "b", "c", "e"])
def filter_words(words):
return [word for word in words if word in ok_words]
If it doesn't fit in memory, you can use shelve
The structure you try to create is known as Inverted Index. Here you can find some general information about it and snippets from Heaps and Mills's implementation. Unfortunately, I wasn't able to find it's source, as well as any other efficient implementation. (Please leave comment if you will find any.)
If you haven't a goal to create a library in pure Python, you can use PyLucene - Python extension for accessing Lucene, which is in it's turn very powerful search engine in Java. Lucene implements inverted index and can easily provide you information on word frequency. It also supports wide range of analyzers (parsers + stemmers) for a dozen of languages.
(Also note, that Lucene already has it's own Similarity measure class.)
Some words about similarity and Vector Space Models. It is very powerful abstraction, but your implementation suffers several disadvantages. With a growth of number of documents in your index your co-occurrence matrix will became to big to fit in memory, and searching in it will take a long time. To stop this effect dimension reduction is used. In methods like LSA this is done by Singular Value Decomposition. Also pay attention to such techniques as PLSA, which uses probabilistic theory, and Random Indexing, which is the only incremental (and so the only appropriate for the large indexes) VSM method.
Related
So i have about 300-500 text articles that i would like to compare the similarity of and figure which are related / duplicates some articles might be addressing the same topics but not identical. so to tackle this i started experimenting with spaCy and the similarity function .. now the problem is similarity only compares two documents at a time and I think i would need to loop every single text and to compare it to the other one which is a very slow and memory consuming process is there a way around this ?
I don't know how you are going to go about comparing similarities between texts, but let's say that you are going to compare each one to another using Jaccard or cosine similarities.
Then, you could use the all-pairs similarity search proposed in this paper which has an implementation here. This algorithm is extremely fast, especially for such a small data size.
The all-pairs search returns two documents and their similarity, so if you want to find a "family" of similar documents, then you will further need to apply a graph traversal like DFS. A stack overflow post on python tuples uses adjacency lists and provides O^(n+m) time complexity.
Here's an example where you could use the all-pairs algorithm that tries to find reposts in the reddit jokes subreddit.
In a spelling error detection task, I use marisa_tries data structures for my lexicon with Python 3.5.
Short question
How to add an element in a marisa_trie ?
Context
The idea is : if a word is in my lexicon, then it is correct. Now, if it is not in my lexicon, it is probably incorrect. But I computed frequencies of words on the overall document and if a word frequency is high enough, I want to save this word, considering it's frequent enough so probably correct.
In that case, how to add this new word to my marisa_trie.Trie lexicon? (without having to build a new trie every time)?
Thank you :)
marisa_trie.Trie implements an immutable trie, so the answer to your question is: it is not possible.
You might want to try a similar Python package called datrie which supports modification and relatively fast queries (PyPI page lists some benchmark against builtin dict).
I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.
How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?
Would the reverse search: FTS matching terms in sentence be faster?
If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.
If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.
This question already has answers here:
How to compute the similarity between two text documents?
(13 answers)
Closed 6 years ago.
Are there any libraries for computing semantic similarity scores for a pair of sentences ?
I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are.
I found a work in progress that's written using the .NET framework that computes the score using an array of pre-processing steps.
Is there any project that does this in python?
I'm not looking for the sequence of operations that would help me find the score (as is asked for here)
I'd love to implement each stage on my own, or glue functions from different libraries so that it works for sentence pairs, but I need this mostly as a tool to test inferences on data.
EDIT: I was considering using NLTK and computing the score for every pair of words iterated over the two sentences, and then draw inferences from the standard deviation of the results, but I don't know if that's a legitimate estimate of similarity. Plus, that'll take a LOT of time for long strings.
Again, I'm looking for projects/libraries that already implement this intelligently. Something that lets me do this:
import amazing_semsim_package
str1='Birthday party ruined as cake explodes'
str2='Grandma mistakenly bakes cake using gunpowder'
>>similarity(str1,str2)
>>0.889
The best package I've seen for this is Gensim, found at the Gensim Homepage. I've used it many times, and overall been very happy with it's ease of use; it is written in Python, and has an easy to follow tutorial to get you started, which compares 9 strings. It can be installed via pip, so you won't have a lot of hassle getting it installed I hope.
Which scoring algorithm you use depends heavily on the context of your problem, but I'd suggest starting of with the LSI functionality if you want something basic. (That's what the tutorial walks you through.)
If you go through the tutorial for gensim, it will walk you through comparing two strings, using the Similarities function. This will allow you to see how your stings compare to each other, or to some other sting, on the basis of the text they contain.
If you're interested in the science behind how it works, check out this paper.
Unfortunately, I cannot help you with PY but you may take a look at my old project that uses dictionaries to accomplish the Semantic comparisons between the sentences (which can later be coded in PY implementing the vector-space analysis). It should be just a few hrs of coding to translate from JAVA to PY.
https://sourceforge.net/projects/semantics/
AFAIK the most powerfull NLP-Lib for Python is http://nltk.org/
I am trying to utilize k-nearest neighbors for the string similarity problem i.e. given a string and a knowledge base, I want to output k strings that are similar to my given string. Are there any tutorials that explain how to utilize kd-trees to efficiently do this k-nearest neighbor lookup for strings? The string length will not exceed more than 20 characters.
Probably one of the hottest blog posts I had read a year or so ago: Levenstein Automata. Take a look at that article. It provides not only a description of the algorithm but also code to follow. Technically, it's not a kd-tree but it's quite related to the string matching and dictionary correction algorithms one might encounter/use in the real world.
He also has another blog post about BK-trees which are much better at the fuzzy matching for strings and string look ups where there are mispellings. Here is another resource containing source code for a BK-tree (this one I can't verify the accuracy or proper implementation.)