I'm looking for a way to produce a numerical probability value for single words which share a common root word/meaning.
Users will generate content using words like "dancer," "dancing," "danced."
If "dancer" is submitted 30 times, and dancing 5 times, I need just one value "dance:35" which catches all of them.
But when users also submit words like "accordance," it should not effect my "dance" count, but instead, add to a sperate count along with words like "according" and "accordingly."
Also, I don't have a pre-defined list of root words to look out for. I'll need to create it dynamically based on that user-generated content.
So my question is, what's probably the best way to pull this off? I'm sure there will be no perfect solution, but I figure someone on here can probably can come up with a better way than me.
My idea so far is to assume that most meaningful words will have at least 3 or 4 letters. So for every word I encounter whose length is greater than 4, trim it down to 4 ("dancers" becomes "danc"), check my list of words to see if I've encountered it before, if so - increment it's count, and if not - add it to that list, repeat.
I see there are some similar questions on here. But I haven't found any answer which considers roots AND I can implement in python. Answers seem to be for one or the other.
You don't need a Python wrapper for a Java lib, nltk's got Snowball! :)
>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'
Stemming isn't always going to give you exact roots, but it's a great start.
Following is an example of using stems. I'm building a dictionary of stem: (word, count) while choosing the shortest word possible for each stem. So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}
Example Code: (text taken from http://en.wikipedia.org/wiki/Dance)
import re
from nltk.stem import SnowballStemmer as SS
text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]
stemmer = SS('english')
counts = dict()
#count stems and extract shortest words possible
for word in words:
stem = stemmer.stem(word)
if stem in counts:
shortest,count = counts[stem]
if len(word) < len(shortest):
shortest = word
counts[stem] = (shortest,count+1)
else:
counts[stem]=(word,1)
#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
print '%s:%d (Root: %s)' % item
Output:
dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---
I wouldn't recommend lemmatization for your specific needs:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'
Substrings are not a good idea because it will always fail at a certain point, and many times fail miserably.
fixed-length: pseudo-words 'dancitization' and 'dancendence' will match 'dance' at 4 and 5 characters respectively.
ratio: a low ratio will return fakes (like above)
ratio: a high ratio will not match enough (e.g. 'running')
But with stemming you get:
>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'
Which is an impressive no-match result for the stem "danc". Even considering 'dancer' does not stem to 'danc', the accuracy is pretty high overall.
I hope this helps you get started.
What you are looking for is also known as the stems (more technical perspective than the linguistic "root") of the words. You are right in assuming that there is no perfect solution, all approaches will either have imperfect analysis or lack coverage.
Basically the best way to go is either with a word-list that contains stems, or a stemming algorithm.
Check the first answer here to get a python-based solution:
How do I do word Stemming or Lemmatization?
I am using Snowball in all my Java-based projects and it works perfect for my purposes (it is very fast, too, and covers a wide range of languages). It also seems to have a Python wrapper:
http://snowball.tartarus.org/download.php
Related
I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.
I have two list of names (strings) that look like this:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
analysts = ['Justin Post', 'Some Dude', 'Some Chick']
I need to find where those names occur in a list of strings that looks like this:
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.
The reason I need to do this is so that I can concatenate the conversation strings together (which are separated by the names). How would I go about doing this efficiently?
I looked at some similar questions and tried the solutions to no avail, such as this:
if any(x in str for x in executives):
print('yes')
and this ...
match = next((x for x in executives if x in str), False)
match
I'm not sure if that is what you are looking for:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]
result = [s for s in text if any(ex in s for ex in executives)]
print(result)
output:
['Brian Olsavsky - Amazon.com']
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]
executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']
As an addition, if you need the exact location, you could use this:
print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])
this outputs
[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]
TLDR
This answer is focusing on efficiency. Use the other answers if it is not a key issue. If it is, make a dict from the corpus you are searching in, then use this dict to find what you are looking for.
#import stuff we need later
import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt
Create example corpus
First we create a list of strings which we will search in.
Create random words, by which I mean random sequence of characters, with a length drawn from a Poisson distribution, with this function:
def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])
(lam_word being the parameter of the Poisson distribution.)
Let's create number_of_sentences variable length sentences from these words (by sentence I mean list of the randomly generated words separated by spaces).
The length of the sentences can also be drawn from a Poisson distribution.
lam_word=5
lam_sentence=1000
number_of_sentences = 10000
sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
for x in range(number_of_sentences)]
sentences[0] now will start like this:
tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc...
Let's also create names, which we will search for. Let these names be bigrams. The first name (ie first element of bigram) will be n characters, last name (second bigram element) will be m character long, and it will consist of random characters:
def bigramgen(n,m):
return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
''.join([random.choice(string.ascii_lowercase) for _ in range(m)])
The task
Let's say we want to find sentences where bigrams such as ab c appears. We don't want to find dab c or ab cd, only where ab c stands alone.
To test how fast a method is, let's find an ever-increasing number of bigrams, and measure the elapsed time. The number of bigrams we search for can be, for example:
number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]
Brute force method
Just loop through each bigram, loop through each sentence, use in to find matches. Meanwhile, measure elapsed time with time.time().
bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start = time.time()
for bigram in bigrams:
#the core of the brute force method starts here
reslist=[]
for sentencei, sentence in enumerate(sentences):
if ' '+bigram+' ' in sentence:
reslist.append([bigram,sentencei])
#and ends here
end = time.time()
bruteforcetime.append(end-start)
bruteforcetime will hold the number of seconds necessary to find 10, 30, 50 ... bigrams.
Warning: this might take a long time for high number of bigrams.
The sort your stuff to make it quicker method
Let's create an empty set for every word appearing in any of the sentences (using dict comprehension):
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
To each of these sets, add the index of each word in which it appears:
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
Note that we only do this once, no matter how many bigrams we search later.
Using this dictionary, we can seach for the sentences where the each part of the bigram appears. This is very fast, since calling a dict value is very fast. We then take the intersection of these sets. When we are searching for ab c, we will have a set of sentence indexes where ab and c both appears.
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in target.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
Let's put the whole thing together, and measure the time elapsed:
logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start_time=time.time()
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in bigram.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
end_time=time.time()
logtime.append(end_time-start_time)
Warning: this might take a long time for high number of bigrams, but less than the brute force method.
Results
We can plot how much time each method took.
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Or, plotting the y axis on a log scale:
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Giving us the plots:
Making the worddict dictionary takes a lot of time, and is a disadvantage when searching for a small number of names. There is a point however, that the corpus is big enough and the number of names we are searching for is high enough that this time is compensated by the speed of searching in it, compared to the brute force method. So, if these conditions are satisfied, I recommend using this method.
(Notebook available here.)
I am working on a problem where I get a lot of words with their frequency of occurrence listed. Here is a sample of what I get:
drqsQAzaQ:1
OnKxnXecCINJ:1
QoGzQpg:1
cordially:1
Sponsorship:1
zQnpzQou:1
Thriving:1
febrero:1
rzaye:1
VseKEX:1
contributed:1
SNfXQoWV:1
hRwzmPR:1
Happening:1
TzJYAMWAQUIJTkWYBX:1
DYeUIqf:1
formats:1
eiizh:1
wIThY:1
infonewsletter:8
BusinessManager:10
MailScanner:12
As you can see, words like 'cordially' are actual English words, while words like 'infonewsletter' are not actual English words by themselves, but we can see that they are actually in English and mean something. However, words like 'OnKxnXecCINJ' do not mean anything (actually they are words from another charset, but I am ignoring them in my exercise and sticking to English) - I can discard them as junk
What would be the best method in Python to detect and eliminate such junk words from a given dictionary such as the one above?
I tried examining each word using nltk.corpus.word.words(), but it is killing my performance as my data set is very huge. Moreover, I am not certain whether this will give me a True for words like 'infonewsletter'
Please help.
Thanks,
Mahesh.
If the words are from completely different script within Unicode like CJK characters or Greek, Cyrillic, Thai, you could use unicodedata.category to see if they're letters to begin with (category starts with L):
>>> import unicodedata
>>> unicodedata.category('a')
'Ll'
>>> unicodedata.category('E')
'Lu'
>>> unicodedata.category('中')
'Lo'
>>> [unicodedata.category(i).startswith('L') for i in 'aE中,']
[True, True, True, False]
Then you can use the unicodedata.name to see that they're Latin letters:
>>> 'LATIN' in unicodedata.name('a')
True
>>> 'LATIN' in unicodedata.false('中')
False
Presumably it is not an English-language word if it has non-Latin letters in it.
Otherwise, you could use a letter bigram/trigram classifier to find out if there is a high probability these are English words. For example OnKxnXecCINJ contains Kxn which is a trigram that neither very probably exist in any single English language word, nor any concatenation of 2 words.
You can build one yourself from the corpus by splitting words into character trigrams, or you can use any of the existing libraries like langdetect or langid or so.
Also, see that the corpus is a set for fast in operations; only after the algorithm tells that there is a high probability it is in English, and the word fails to be found in the set, consider that it is alike to infonewsletter - a concatenation of several words; split it recursively into smaller chunks and see that each part thereof is found in the corpus.
Thank you. I am trying out this approach. However, I have a question. I have a word 'vdgutumvjaxbpz'. I know this is junk. I wrote some code to get all grams of this word, 4-gram and higher. This was the result:
['vdgu', 'dgut', 'gutu', 'utum', 'tumv', 'umvj', 'mvja', 'vjax', 'jaxb', 'axbp', 'xbpz', 'vdgut', 'dgutu', 'gutum', 'utumv', 'tumvj', 'umvja', 'mvjax', 'vjaxb', 'jaxbp', 'axbpz', 'vdgutu', 'dgutum', 'gutumv', 'utumvj', 'tumvja', 'umvjax', 'mvjaxb', 'vjaxbp', 'jaxbpz', 'vdgutum', 'dgutumv', 'gutumvj', 'utumvja', 'tumvjax', 'umvjaxb', 'mvjaxbp', 'vjaxbpz', 'vdgutumv', 'dgutumvj', 'gutumvja', 'utumvjax', 'tumvjaxb', 'umvjaxbp', 'mvjaxbpz', 'vdgutumvj', 'dgutumvja', 'gutumvjax', 'utumvjaxb', 'tumvjaxbp', 'umvjaxbpz', 'vdgutumvja', 'dgutumvjax', 'gutumvjaxb', 'utumvjaxbp', 'tumvjaxbpz', 'vdgutumvjax', 'dgutumvjaxb', 'gutumvjaxbp', 'utumvjaxbpz', 'vdgutumvjaxb', 'dgutumvjaxbp', 'gutumvjaxbpz', 'vdgutumvjaxbp', 'dgutumvjaxbpz', 'vdgutumvjaxbpz']
Now, I compared each gram result with nltk.corpus.words.words() and found the intersection of the 2 sets.
vocab = nltk.corpus.words.words()
vocab = set(w.lower().strip() for w in vocab)
def GetGramsInVocab(listOfGrams, vocab):
text_vocab = set(w.lower() for w in listOfGrams if w.isalpha())
common = text_vocab & vocab
return list(common)
However, the intersection contains 'utum', whereas I expected it to be NULL.
Also,
print("utum" in vocab)
returned true.
This does not make sense to me. I peeked into the vocabulary and found 'utum' in a few words like autumnian and metascutum
However, 'utum' is not a word by itself and I expected nltk to return false. Is there a more accurate corpus I can check against that would do whole word comparisons?
Also, I did a simple set operations test:
set1 = {"cutums" "acutum"}
print("utum" in set1)
This returned False as expected.
I guess I am confused as to why the code says 'utum' is present in the nltk words corpus.
Thanks,
Mahesh.
I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()
From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]
What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!
Say I have a list of movie names with misspellings and small variations like this -
"Pirates of the Caribbean: The Curse of the Black Pearl"
"Pirates of the carribean"
"Pirates of the Caribbean: Dead Man's Chest"
"Pirates of the Caribbean trilogy"
"Pirates of the Caribbean"
"Pirates Of The Carribean"
How do I group or find such sets of words, preferably using python and/or redis?
Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.
I'm especially fond of the difflib module
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
You might notice that similar strings have large common substring, for example:
"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)
To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.
Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.
So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).
I believe there is in fact two distinct problems.
The first is spell correction. You can have one in Python here
http://norvig.com/spell-correct.html
The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.
related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:
https://code.google.com/p/tfidf/
To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :
def dosearch(terms, searchtype, case, adddir, files = []):
found = []
if files != None:
titlesrch = re.compile('>title<.*>/title<')
for file in files:
title = ""
if not (file.lower().endswith("html") or file.lower().endswith("htm")):
continue
filecontents = open(BASE_DIR + adddir + file, 'r').read()
titletmp = titlesrch.search(filecontents)
if titletmp != None:
title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
filecontents = remove_tags(filecontents)
filecontents = filecontents.lstrip()
filecontents = filecontents.rstrip()
if dofind(filecontents, case, searchtype, terms) > 0:
found.append(title)
found.append(file)
return found
Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/
Regards,
Max
One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.
Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.