how to implement fast spellchecker in Python with Pandas? - python

I work on text mining problem and need to extract all mentioned of certain keywords. For example, given the list:
list_of_keywords = ['citalopram', 'trazodone', 'aspirin']
I need to find all occurrences of the keywords in a text. That could be easily done with Pandas (assuming my text is read in from a csv file):
import pandas as pd
df_text = pd.read_csv('text.csv')
df_text['matches'] = df_text.str.findall('|'.join(list_of_keywords))
However, there are spelling mistakes in the text and some times my keywords will be written as:
'citalopram' as 'cetalopram'
or
'trazodone' as 'trazadon'
Searching on the web, I found some suggestions how to implement the spell checker, but I am not sure where to insert the spell checker and I reckon that it may slow down the search in the case a very large text.
As another option, it has been suggested to use a wild card with regex and insert in the potential locations of confusion (conceptually written)
.findall('c*t*l*pr*m')
However I am not convinced that I can capture all possible problematic cases. I tried some out-of-the-box spell checkers, but my texts are some-what specific and I need a spell checker that 'knows' my domain (medical domain).
QUESTION
Is there any efficient way to extract keywords from a text including spelling mistakes?

You are right, you cannot capture all possible misspellings with regular expressions.
You do however have options.
You could
Use a trie. A lot of auto complete and spell checking solutions use tries. However most of them operate on a word for word basis. Not on text as a whole
That being said, what you really want is a fuzzy text extractor as you just want to match alternate/slightly wrong spellings and not correct those spellings in the text. So you have more options here as well
Computational genomics has this challenge where they want to search for patterns of base pairs in long sequences. They allow for a certain amount of mismatch in the matched text. So an approximate matching solution like the ones outlined here will help. Those slides have excellent use of the pigeon hole principle to do what you need and the code is open source too!
If you want something a lot less complex, just run an edit distance filter on all the words of the doc and admit only words with an edit distance of k or less.
To expand on what I mean by Edit Distance
(Images/Code borrowed from slides linked above, slides are free to use for anyone i.e no license)
Let us examine a simpler concept of Hamming Distance
def hammingDistance(x, y):
assert len(x) == len(y)
nmm = 0
for i in xrange(0, len(x)):
if x[i] != y[i]:
nmm += 1
return nmm
Hamming distance returns the number of characters that must swapped between 2 equal length strings to make them equal.
But what happens when the strings are not equal length?
Use editDistance Which is the num of characters that must be swapped/inserted/deleted on 2 strings to make them equal
Hamming distance becomes the base case for a recursive algorithm now
def edDistRecursive(x, y):
if len(x) == 0: return len(y)
if len(y) == 0: return len(x)
delt = 1 if x[-1] != y[-1] else 0
diag = edDistRecursive(x[:-1], y[:-1]) + delt
vert = edDistRecursive(x[:-1], y) + 1
horz = edDistRecursive(x, y[:-1]) + 1
return min(diag, vert, horz)
Just call the func above on what you think the word would/should match to (maybe by first looking up a trie). You can even memoize the soln to make it faster as there is a high probability for overlap

Related

Is there a way to force SymSpell Python to return more than one correction recommendation?

I'm using the symspellpy module in Python for query correction. It is really useful and fast, but I'm having a issue with it.
Is there a way to force Symspell to return more than one recommendation for correction. I need it to analyse a better correction based on my application.
I'm calling Symspell like this:
suggestions = sym_spell.lookup(query, VERBOSITY_ALL, max_edit_distance=3)
Example of what I'm trying to do:
query = "resende". The return that I want ["resende", "rezende"]. What the method returns ["resende"]. Note that both "resende" and "rezende" are in my dictionary.
Merely a typo. Change the underscore in
Verbosity_ALL ... to
Verbosity.ALL
The three options are CLOSEST, TOP and ALL
Couple of other things in SymSpell ...
Four algorithm choices
Described here
Supported edit distance algorithm choices.
LEVENSHTEIN = 0 Levenshtein algorithm
DAMERAU_OSA = 1 Damerau optimal string alignment algorithm (default)
LEVENSHTEIN_FAST = 2 Fast Levenshtein algorithm
DAMERAU_OSA_FAST = 3 Fast Damerau optimal string alignment algorithm
DAMERAU_OSA # high count/frequency wins when using .ALL but distances tied?
LEVENSHTEIN # lowest edit distance wins (fewest changes needed)
To change from the default, overwrite it with one of them:
from symspellpy.editdistance import DistanceAlgorithm
sym_spell._distance_algorithm = DistanceAlgorithm.LEVENSHTEIN
Output object details
word = 'something'
matches = sym_spell.lookup(word, Verbosity.ALL, max_edit_distance=2)
for match in matches: # match is ... term, distance, count
print(f'{word} -> {match.term} {match.distance} {match.count}')
Using collections Counter() with SymSpell instead of loading words from file
SymSpell can only read the dictionary of ok words from a file currently (Apr 2022) however this can be added inside symspellpy.py to make it able to read from a collections Counter() output dict or other dictionary of words : counts, a mere quick hack that works for my purposes ...
def load_Counter_dictionary(self, counts_each):
for key, count in counts_each.items():
self.create_dictionary_entry(key, count)
Can then drop the use of load_dictionary(), for something like this instead ...
sym_spell.load_Counter_dictionary( Counter(words_list) )
The reason I resorted to that is a million+ record csv file was already loaded into a pandas dataframe containing a column of codes (think words) with some of them in large numbers (likely correct) along with outliers to be corrected and a column already made containing their counts each. So rather than saving the counts dict to file (expensive) and the reload by SymSpell, this is direct and efficient.

Fuzzy finding poker flop in Python

Given a list of poker flops and a str as a target:
target = '5c6d2d'
flops = ['5s4d3s', '6s4d2d', '6s5d3s', '6s4s2d']
I am trying to find the closest match to the target. Currently using fuzzywuzzy.process.extract, but sometimes this doesn't return the desired match. And also (more importantly) it doesn't account for ranks properly because the rank of the face cards are represented by letters so 9c9d9s is more similar too 2c2d2h than it is to TcTdTh. Is there a clever way of parsing the target flop to do this with a simple algorithm? Or would it be best to try and train a machine learning model for this?
Side note in case it's relevant: my fuzzywuzzy uses the pure-python SequenceMatcher as I don't have the privileges to install Visual Studio 14 for python-Levenshtein.
EDIT: To clarify, by closest match I mean the closest in terms of flop texture, i.e. the flop that is strategically the most similar (I understand this is somewhat of a difficult qualification). The initial list of examples I gave is actually not too clear, so here is another example with fuzzywuzzy:
>>> target = '8c8h5s'
>>> flops = ['2d2s6c', '7c5s5d', '8s8d7h', '4h3s3d']
>>> matches = process.extract(target, flops)
>>> print(matches)
[('7c5s5d', 50), ('8s8d7h', 50), ('4h3s3d', 33), ('2d2s6c', 17)]
For my purposes '8s8d7h' should score better than '7c5s5d'. So the matching of ranks should be given priority over the matching of suits.
What do you mean by closest match? Do you mean just least number of different characters?
If so, here's how you could do it:
def find_closest(main_flop, flops):
best_match = ''
matching_characters = 0
for flop in flops:
for i, character in enumerate(flop):
if main_flop[i] == character:
matching_characters += 1
if matching characters > max_matching:
best_match = flop
return best_match
On the other hand, if you want to find the "closest" flop in terms of how similar of equities it gives to a full range of hands, then you will have to build something quite complex, or use a library like https://github.com/worldveil/deuces or https://poker.readthedocs.io/en/latest/

Spark: Generate Map Word to List of Similar Words - Need Better Performance

I am working with DNA sequence alignment, and I have a performance issue.
I need to create a dict that maps a word (a sequence of a set length) to a list of all words that are similar as decided by a separate function.
Right now, I am doing the following:
all_words_rdd = sc.parallelize([''.join(word) for word in itertools.product(all_letters, repeat=WORD_SIZE)], PARALLELISM)
all_similar_word_pairs_map = (all_words_rdd.cartesian(all_words_rdd)
.filter(lambda (word1, word2), scoring_matrix=scoring_matrix, threshold_value=threshold_value: areWordsSimilar((word1, word2), scoring_matrix, threshold_value))
.groupByKey()
.mapValues(set)
.collectAsMap())
Where areWordsSimilar obviously calculates whether the words reach a set similarity threshold.
However, this is horribly slow. It works fine with words of length 3, but once I go any higher it slows down exponentially (as you might expect). It also starts complaining about the task size being too big (again, not surprising)
I know the cartesian join is a really inefficient way to do this, but I'm not sure how to approach it otherwise.
I was thinking of starting with something like this:
all_words_rdd = (sc.parallelize(xrange(0, len(all_letters) ** WORD_SIZE))
.repartition(PARALLELISM)
...
)
This would let me split the calculation across multiple nodes. However, how do I calculate this? I was thinking about doing something with bases and inferring the letter using the modulo operator (i.e. in base of len(all_letters), num % 2 = all_letters[0], num % 3 = all_letters[1], etc).
However, this sounds horribly complicated, so I was wondering if anybody had a better way.
Thanks in advance.
EDIT
I understand that I cannot reduce the exponential complexity of the problem, that is not my goal. My goal is to break up the complexity across multiple nodes of execution by having each node perform part of the calculation. However, to do this I need to be able to derive a DNA word from a number using some process.
Generally speaking even without driver side code it looks like a hopeless task. Size of the sequence set is growing exponentially and you simply cannot win with that. Depending on how you plan to use this data there is most likely a better approach out there.
If you still want to go with this you can start with spiting kmers generation between a driver and workers:
from itertools import product
def extend_kmer(n, kmer="", alphabet="ATGC"):
"""
>>> list(extend_kmer(2))[:4]
['AA', 'AT', 'AG', 'AC']
"""
tails = product(alphabet, repeat=n)
for tail in tails:
yield kmer + "".join(tail)
def generate_kmers(k, seed_size, alphabet="ATGC"):
"""
>>> kmers = generate_kmers(6, 3, "ATGC").collect()
>>> len(kmers)
4096
>>> sorted(kmers)[0]
'AAAAAA'
"""
seed = sc.parallelize([x for x in extend_kmer(seed_size, "", alphabet)])
return seed.flatMap(lambda kmer: extend_kmer(k - seed_size, kmer, alphabet))
k = ... # Integer
seed_size = ... # Integer <= k
kmers = generate_kmers(k, seed_size) # RDD kmers
The simplest optimization you can do when it comes to searching is to drop cartesian and use a local generation:
from difflib import SequenceMatcher
def is_similar(x, y):
"""Dummy similarity check
>>> is_similar("AAAAA", "AAAAT")
True
>>> is_similar("AAAAA", "TTTTTT")
False
"""
return SequenceMatcher(None, x, y).ratio() > 0.75
def find_similar(kmer, f=is_similar, alphabet="ATGC"):
"""
>>> kmer, similar = find_similar("AAAAAA")
>>> sorted(similar)[:5]
['AAAAAA', 'AAAAAC', 'AAAAAG', 'AAAAAT', 'AAAACA']
"""
candidates = product(alphabet, repeat=len(kmer))
return (kmer, {"".join(x) for x in candidates if is_similar(kmer, x)})
similar_map = kmers.flatmap(find_similar)
It is still an extremely naive approach but it doesn't require expensive data shuffling.
Next thing you can try is to improve search strategy. It can be done either locally like above or globally using joins.
In both cases you need a smarter approach than checking all possible kmers. First thing that comes to mind is to use seed kmers taken from a given word. In locally mode these can be used as a starting point for candidate generation, in a global mode a join key (optionally combined with hashing).

check if two words are related to each other

I have two lists: one, the interests of the user; and second, the keywords about a book. I want to recommend the book to the user based on his given interests list. I am using the SequenceMatcher class of Python library difflib to match similar words like "game", "games", "gaming", "gamer", etc. The ratio function gives me a number between [0,1] stating how similar the 2 strings are. But I got stuck at one example where I calculated the similarity between "looping" and "shooting". It comes out to be 0.6667.
for interest in self.interests:
for keyword in keywords:
s = SequenceMatcher(None,interest,keyword)
match_freq = s.ratio()
if match_freq >= self.limit:
#print interest, keyword, match_freq
final_score += 1
break
Is there any other way to perform this kind of matching in Python?
Firstly a word can have many senses and when you try to find similar words you might need some word sense disambiguation http://en.wikipedia.org/wiki/Word-sense_disambiguation.
Given a pair of words, if we take the most similar pair of senses as the gauge of whether two words are similar, we can try this:
from nltk.corpus import wordnet as wn
from itertools import product
wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)
maxscore = 0
for i,j in list(product(*[sem1,sem2])):
score = i.wup_similarity(j) # Wu-Palmer Similarity
maxscore = score if maxscore < score else maxscore
There are other similarity functions that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. The only problem is when you encounter words not in wordnet. Then i suggest you fallback on difflib.
At first, I thought to regular expressions to perform additional tests to discriminate the matchings with low ratio. It can be a solution to treat specific problem like the one happening with words ending with ing. But that's only a limited case and thre can be numerous other cases that would need to add specific treatment for each one.
Then I thought that we could try to find additional criterium to eliminate not semantically matching words having a letters simlarity ratio enough to be detected as matcging together though the ratio is low,
WHILE in the same time catching real semantically matching terms having low ratio because they are short.
Here's a possibility
from difflib import SequenceMatcher
interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')
s = SequenceMatcher(None)
limit = 0.50
for interest in interests:
s.set_seq2(interest)
for keyword in keywords:
s.set_seq1(keyword)
b = s.ratio()>=limit and len(s.get_matching_blocks())==2
print '%10s %-10s %f %s' % (interest, keyword,
s.ratio(),
'** MATCH **' if b else '')
print
gives
shooting loop 0.333333
shooting looping 0.666667
shooting game 0.166667
gaming loop 0.000000
gaming looping 0.461538
gaming game 0.600000 ** MATCH **
looping loop 0.727273 ** MATCH **
looping looping 1.000000 ** MATCH **
looping game 0.181818
Note this from the doc:
SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence against many
sequences, use set_seq2() to set the commonly used sequence once and
call set_seq1() repeatedly, once for each of the other sequences.
Thats because SequenceMatcher is based on edit distance or something alike. semantic similarity is more suitable for your case or hybrid of the two.
take a look NLTK pack (code example) as you are using python and maybe this paper
for people using c++ can check this open source project for reference

Efficient way of resolving unknown words to known words?

I am designing a text processing program that will generate a list of keywords from a long itemized text document, and combine entries for words that are similar in meaning. There are metrics out there, however I have a new issue of dealing with words that are not in the dictionary that I am using.
I am currently using nltk and python, but my issues here are of a much more abstracted nature. Given a word that is not in a dictionary, what would be an efficient way of resolving it to a word that is within your dictionary? My only current solution involves running through the words in the dictionary and picking the word with the shortest Levenshtein distance (editing distance) from the inputted word.
Obviously this is a very slow and impractical method, and I don't actually need the absolute best match from within the dictionary, just so long as it is a contained word and it is pretty close. Efficiency is more important for me in the solution, but a basic level of accuracy would also be needed.
Any ideas on how to generally resolve some unknown word to a known one in a dictionary?
Looks like you need a spelling corrector to match words in your dictionary.
The code below works and taken directly from this blog http://norvig.com/spell-correct.html written by Peter Norvig,
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
big.txt is your dictionary containing known words.
Your task sounds like it's really just non-word spelling correction, so a relatively straight-forward solution would be to use an existing spell checker like aspell with a custom dictionary.
A quick and dirty approach would be to just use a phonetic mapping like metaphone (which is one the algorithms used by aspell). For each possible code derived from your dictionary, choose a representative word (e.g., the most frequent word in the group) to suggest as the correction, and pick a default correction for the case where no matches are found. But you'd probably get better results using aspell.
If you do want to calculate edit distances, you can do it relatively quickly by storing the dictionary and possible edit operations in tries, see Brill and Moore (2000). If you have a decent-sized corpus of spelling errors and their corrections and can implement Brill and Moore's whole approach, you would probably beat aspell by quite a bit, but it sounds like aspell (or any spell checker that lets you create your own dictionary) is sufficient for your task.
Hopefully this answer is not too vague:
1) It sounds like you might need to look at your tokenisation and phrase chunking layers first. This is is where you should discard symbolic phrase chunks before submitting them to any fuzzy spell checking.
2) I would still recommend edit distance to come up with alternatives to any 'mis-spelt' tokens after that, but this may return a list of equally close possibles.
3)When you have your list of possibles, you could then use co-occurence algorithms to select the most likey candidate from this list. I only have a java example of some software that could help ( http://www.linguatools.de/disco/disco_en.html#was ) . You can submit a word, and this will return the difinitive co-occuring words for that word. You can then compare this list to the context of your 'mis-spelt' word, and select the one with the most overlap from all potential edit distance words.
I do not see a reason to use Levenshtein distance to find a word similar in meaning. LD looks at form (you want to map "bus" to "truck" not to "bush").
The correct solution depends on what you want to do next.
Unless you really need the information in those unknown words, I would simply map all of them to a single generic "UNKNOWN_WORD" item.
Obviously you can cluster the unknown words by their context and other features (say, do they start by a capital letter). For context clustering: since you are interested in meaning, I would use a larger window for those words (say +/- 50 words) and probably use a simple bag of words model. Then you simply find a known word whose vector in this space is closest to the unknown word using some distance metrics (say, cosine). Let me know if you need more information about this.

Categories

Resources