I am designing a text processing program that will generate a list of keywords from a long itemized text document, and combine entries for words that are similar in meaning. There are metrics out there, however I have a new issue of dealing with words that are not in the dictionary that I am using.
I am currently using nltk and python, but my issues here are of a much more abstracted nature. Given a word that is not in a dictionary, what would be an efficient way of resolving it to a word that is within your dictionary? My only current solution involves running through the words in the dictionary and picking the word with the shortest Levenshtein distance (editing distance) from the inputted word.
Obviously this is a very slow and impractical method, and I don't actually need the absolute best match from within the dictionary, just so long as it is a contained word and it is pretty close. Efficiency is more important for me in the solution, but a basic level of accuracy would also be needed.
Any ideas on how to generally resolve some unknown word to a known one in a dictionary?
Looks like you need a spelling corrector to match words in your dictionary.
The code below works and taken directly from this blog http://norvig.com/spell-correct.html written by Peter Norvig,
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
big.txt is your dictionary containing known words.
Your task sounds like it's really just non-word spelling correction, so a relatively straight-forward solution would be to use an existing spell checker like aspell with a custom dictionary.
A quick and dirty approach would be to just use a phonetic mapping like metaphone (which is one the algorithms used by aspell). For each possible code derived from your dictionary, choose a representative word (e.g., the most frequent word in the group) to suggest as the correction, and pick a default correction for the case where no matches are found. But you'd probably get better results using aspell.
If you do want to calculate edit distances, you can do it relatively quickly by storing the dictionary and possible edit operations in tries, see Brill and Moore (2000). If you have a decent-sized corpus of spelling errors and their corrections and can implement Brill and Moore's whole approach, you would probably beat aspell by quite a bit, but it sounds like aspell (or any spell checker that lets you create your own dictionary) is sufficient for your task.
Hopefully this answer is not too vague:
1) It sounds like you might need to look at your tokenisation and phrase chunking layers first. This is is where you should discard symbolic phrase chunks before submitting them to any fuzzy spell checking.
2) I would still recommend edit distance to come up with alternatives to any 'mis-spelt' tokens after that, but this may return a list of equally close possibles.
3)When you have your list of possibles, you could then use co-occurence algorithms to select the most likey candidate from this list. I only have a java example of some software that could help ( http://www.linguatools.de/disco/disco_en.html#was ) . You can submit a word, and this will return the difinitive co-occuring words for that word. You can then compare this list to the context of your 'mis-spelt' word, and select the one with the most overlap from all potential edit distance words.
I do not see a reason to use Levenshtein distance to find a word similar in meaning. LD looks at form (you want to map "bus" to "truck" not to "bush").
The correct solution depends on what you want to do next.
Unless you really need the information in those unknown words, I would simply map all of them to a single generic "UNKNOWN_WORD" item.
Obviously you can cluster the unknown words by their context and other features (say, do they start by a capital letter). For context clustering: since you are interested in meaning, I would use a larger window for those words (say +/- 50 words) and probably use a simple bag of words model. Then you simply find a known word whose vector in this space is closest to the unknown word using some distance metrics (say, cosine). Let me know if you need more information about this.
Related
I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?
A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes
You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.
I work on text mining problem and need to extract all mentioned of certain keywords. For example, given the list:
list_of_keywords = ['citalopram', 'trazodone', 'aspirin']
I need to find all occurrences of the keywords in a text. That could be easily done with Pandas (assuming my text is read in from a csv file):
import pandas as pd
df_text = pd.read_csv('text.csv')
df_text['matches'] = df_text.str.findall('|'.join(list_of_keywords))
However, there are spelling mistakes in the text and some times my keywords will be written as:
'citalopram' as 'cetalopram'
or
'trazodone' as 'trazadon'
Searching on the web, I found some suggestions how to implement the spell checker, but I am not sure where to insert the spell checker and I reckon that it may slow down the search in the case a very large text.
As another option, it has been suggested to use a wild card with regex and insert in the potential locations of confusion (conceptually written)
.findall('c*t*l*pr*m')
However I am not convinced that I can capture all possible problematic cases. I tried some out-of-the-box spell checkers, but my texts are some-what specific and I need a spell checker that 'knows' my domain (medical domain).
QUESTION
Is there any efficient way to extract keywords from a text including spelling mistakes?
You are right, you cannot capture all possible misspellings with regular expressions.
You do however have options.
You could
Use a trie. A lot of auto complete and spell checking solutions use tries. However most of them operate on a word for word basis. Not on text as a whole
That being said, what you really want is a fuzzy text extractor as you just want to match alternate/slightly wrong spellings and not correct those spellings in the text. So you have more options here as well
Computational genomics has this challenge where they want to search for patterns of base pairs in long sequences. They allow for a certain amount of mismatch in the matched text. So an approximate matching solution like the ones outlined here will help. Those slides have excellent use of the pigeon hole principle to do what you need and the code is open source too!
If you want something a lot less complex, just run an edit distance filter on all the words of the doc and admit only words with an edit distance of k or less.
To expand on what I mean by Edit Distance
(Images/Code borrowed from slides linked above, slides are free to use for anyone i.e no license)
Let us examine a simpler concept of Hamming Distance
def hammingDistance(x, y):
assert len(x) == len(y)
nmm = 0
for i in xrange(0, len(x)):
if x[i] != y[i]:
nmm += 1
return nmm
Hamming distance returns the number of characters that must swapped between 2 equal length strings to make them equal.
But what happens when the strings are not equal length?
Use editDistance Which is the num of characters that must be swapped/inserted/deleted on 2 strings to make them equal
Hamming distance becomes the base case for a recursive algorithm now
def edDistRecursive(x, y):
if len(x) == 0: return len(y)
if len(y) == 0: return len(x)
delt = 1 if x[-1] != y[-1] else 0
diag = edDistRecursive(x[:-1], y[:-1]) + delt
vert = edDistRecursive(x[:-1], y) + 1
horz = edDistRecursive(x, y[:-1]) + 1
return min(diag, vert, horz)
Just call the func above on what you think the word would/should match to (maybe by first looking up a trie). You can even memoize the soln to make it faster as there is a high probability for overlap
I have two lists: one, the interests of the user; and second, the keywords about a book. I want to recommend the book to the user based on his given interests list. I am using the SequenceMatcher class of Python library difflib to match similar words like "game", "games", "gaming", "gamer", etc. The ratio function gives me a number between [0,1] stating how similar the 2 strings are. But I got stuck at one example where I calculated the similarity between "looping" and "shooting". It comes out to be 0.6667.
for interest in self.interests:
for keyword in keywords:
s = SequenceMatcher(None,interest,keyword)
match_freq = s.ratio()
if match_freq >= self.limit:
#print interest, keyword, match_freq
final_score += 1
break
Is there any other way to perform this kind of matching in Python?
Firstly a word can have many senses and when you try to find similar words you might need some word sense disambiguation http://en.wikipedia.org/wiki/Word-sense_disambiguation.
Given a pair of words, if we take the most similar pair of senses as the gauge of whether two words are similar, we can try this:
from nltk.corpus import wordnet as wn
from itertools import product
wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)
maxscore = 0
for i,j in list(product(*[sem1,sem2])):
score = i.wup_similarity(j) # Wu-Palmer Similarity
maxscore = score if maxscore < score else maxscore
There are other similarity functions that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. The only problem is when you encounter words not in wordnet. Then i suggest you fallback on difflib.
At first, I thought to regular expressions to perform additional tests to discriminate the matchings with low ratio. It can be a solution to treat specific problem like the one happening with words ending with ing. But that's only a limited case and thre can be numerous other cases that would need to add specific treatment for each one.
Then I thought that we could try to find additional criterium to eliminate not semantically matching words having a letters simlarity ratio enough to be detected as matcging together though the ratio is low,
WHILE in the same time catching real semantically matching terms having low ratio because they are short.
Here's a possibility
from difflib import SequenceMatcher
interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')
s = SequenceMatcher(None)
limit = 0.50
for interest in interests:
s.set_seq2(interest)
for keyword in keywords:
s.set_seq1(keyword)
b = s.ratio()>=limit and len(s.get_matching_blocks())==2
print '%10s %-10s %f %s' % (interest, keyword,
s.ratio(),
'** MATCH **' if b else '')
print
gives
shooting loop 0.333333
shooting looping 0.666667
shooting game 0.166667
gaming loop 0.000000
gaming looping 0.461538
gaming game 0.600000 ** MATCH **
looping loop 0.727273 ** MATCH **
looping looping 1.000000 ** MATCH **
looping game 0.181818
Note this from the doc:
SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence against many
sequences, use set_seq2() to set the commonly used sequence once and
call set_seq1() repeatedly, once for each of the other sequences.
Thats because SequenceMatcher is based on edit distance or something alike. semantic similarity is more suitable for your case or hybrid of the two.
take a look NLTK pack (code example) as you are using python and maybe this paper
for people using c++ can check this open source project for reference
I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of #ninjagecko and #Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)
Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
Edit: Here are some timings:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
Results:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)
Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B):
Initialize a hashtable or trie
For each string b in B:
For every substring of that string
Add the substring to the hashtable/trie (this is no worse than 55*O(B)=O(B)), with metadata of which string it belonged to
For each string a in A:
Do an O(1) query to your hashtable/trie to find all B-strings it is in, yield those
(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2), though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2) with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)
edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...) is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.
A very fast way to search for a lot of strings is to make use of a finite automaton (so you were not that far with the guess of regexp), namely the Aho Corasick string matching machine, which is used in tools like grep, virus scanners and the like.
First it compiles the strings you want to search for (in your case the words in A) into a finite-state automaton with failure function (see the paper from '75 if you are interested in details). This automaton then reads the input string(s) and outputs all found search strings (probably you want to modify it a bit, so that it outputs the string in which the search string was found aswell).
This method has the advantage that it searches all search strings at the same time and thus needs to look at every character of the input string(s) only once (linear complexity)!
There are implementations of the aho corasick pattern matcher at pypi, but i haven't tested them, so I can't say anything about performance, usability or correctness of these implementations.
EDIT: I tried this implementation of the Aho-Corasick automaton and it is indeed the fastest of the suggested methods so far, and also easy to use:
import pyahocorasick
def aho(A, B):
t = pyahocorasick.Trie();
for a in A:
t.add_word(a, a)
t.make_automaton()
return [(s,b) for b in B for (i,res) in t.iter(b) for s in res]
One thing I observed though, was when testing this implementation with #SvenMarnachs script it yielded slightly less results than the other methods and I am not sure why. I wrote a mail to the creator, maybe he will figure it out.
There are specialized index structures for this, see for example
http://en.wikipedia.org/wiki/Suffix_tree
You'd build a suffix-tree or something similar for B, then use A to query it.
Edit: New question. Given that I suspect this question is more difficult than I originally thought-- potentially NP-complex, I have a different question: what is useful algorithm that can get close to maximizing the number of letters used?
I was inspired to write a program based on the card game Scrabble Slam! I'm new to algorithms, however, and can't think of an efficient way of solving this problem:
You start with a string containing an English-valid 4 letter word. You can place one letter on that word at a time such that by placing the letter you form a new dictionary-valid word. The letters that you have are equal to the letters in the alphabet.
For example if the starting word was "cart", this could be a valid sequence of moves:
sand --> sane --> sine --> line --> lins --> pins --> fins , etc.
The goal is to maximize the number of words in a sequence by using as many letters of the alphabet (without using a letter more than once).
My algorithm can't find the longest sequence, just a guess at what the longest might be.
First, it gets a list of all words that can be formed by changing one letter of the starting word. That list is called 'adjacentList.' It then looks through all the words in adjacentList and finds which of those words have the most adjacent words. Whichever word has the most adjacentWords, it chooses to turn the starting word into.
For example, the word sane can be turned in 28 other words, the word sine can be turned into 27 other words, the word line can be turned into 30 other words-- each one of these choices was made to maximize the likelihood that more and more words could be spelled.
What would be the best way to go about solving this problem? What sort of data structure would be optimal? How can I improve my code to make it more efficient and less verbose?
##Returns a list of all adjacent words. Lists contain tuples of the adjacent word and the letter that
##makes the difference between those two words.
def adjacentWords(userWord):
adjacentList, exactMatches, wrongMatches = list(), list(), str()
for dictWords in allWords:
for a,b in zip(userWord, dictWords):
if a==b: exactMatches.append(a)
else: wrongMatches = b
if len(exactMatches) == 3:
adjacentList.append((dictWords, wrongMatches))
exactMatches, wrongMatches = list(), list()
return adjacentList
#return [dictWords for dictWords in allWords if len([0 for a,b in zip(userWord, dictWords) if a==b]) == 3]
def adjacentLength(content):
return (len(adjacentWords(content[0])), content[0], content[1])
#Find a word that can be turned into the most other words by changing one letter
def maxLength(adjacentList, startingLetters):
return max(adjacentLength(content) for content in adjacentList if content[1] in startingLetters)
def main():
startingWord = "sand"
startingLetters = "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z".replace(" ", "").split(',')
existingWords = list()
while True:
adjacentList = adjacentWords(startingWord)
letterChoice = maxLength(adjacentList, startingLetters)
if letterChoice[1] not in existingWords:
print "Going to use letter: "+ str(letterChoice[2]) + " to spell the word "+ str(letterChoice[1]) + " with "+ str(letterChoice[0]) + " possibilities."
existingWords.append(letterChoice[1])
startingLetters.remove(str(letterChoice[2]))
startingWord = letterChoice[1]
main()
#Parseltongue
You might find out that although, there is an optimal solution for the given task, yet there is no known way to solve it in optimal way. This task is one of the NP-class problems.
Consider this:
sand --> [?]and
at this point you have to iterate through ALL possible combinations of [?]and to find out words that match the criteria. In worst case, with no heuristics, that is 26 iterations.
sand --> band, hand, land, wand
Now, take every found word and iterate the second letter etc.
You can see, that the amount of iterations you have to make grows exponentially.
This is in some manner very similar to the problem of chess. No matter how powerful your computer is, it can't predict all possible moves, because there are too many combinations.
Me and a friend actually developed a similar program (in Java, though) in a laboration for a data structures and algorithms course. The task was to create the shortest chain of words from one word to another.
We built up a tree of all available words where one node was connected to another iff they only differed by one letter. Using some kind of hashtable, such as a dictionary, is required for speed. We actually removed one letter from the words we used as keys, thus taking advantage of the fact that connected words then had the same hash, but it's hard to explain without having the code.
For finding the shortest chain, we just used a Breadth-first search. However, finding the shortest path in a graph is easier than the longest. See longest path problem for more details.
To solve the main problem, you could just iterate through all words. However, if speed is important is it often easier to try to find a good special heuristic algorithm.