What is a good strategy to group similar words? - python

Say I have a list of movie names with misspellings and small variations like this -
"Pirates of the Caribbean: The Curse of the Black Pearl"
"Pirates of the carribean"
"Pirates of the Caribbean: Dead Man's Chest"
"Pirates of the Caribbean trilogy"
"Pirates of the Caribbean"
"Pirates Of The Carribean"
How do I group or find such sets of words, preferably using python and/or redis?

Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.
I'm especially fond of the difflib module
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

You might notice that similar strings have large common substring, for example:
"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)
To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.
Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.
So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).

I believe there is in fact two distinct problems.
The first is spell correction. You can have one in Python here
http://norvig.com/spell-correct.html
The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.
related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:
https://code.google.com/p/tfidf/

To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :
def dosearch(terms, searchtype, case, adddir, files = []):
found = []
if files != None:
titlesrch = re.compile('>title<.*>/title<')
for file in files:
title = ""
if not (file.lower().endswith("html") or file.lower().endswith("htm")):
continue
filecontents = open(BASE_DIR + adddir + file, 'r').read()
titletmp = titlesrch.search(filecontents)
if titletmp != None:
title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
filecontents = remove_tags(filecontents)
filecontents = filecontents.lstrip()
filecontents = filecontents.rstrip()
if dofind(filecontents, case, searchtype, terms) > 0:
found.append(title)
found.append(file)
return found
Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/
Regards,
Max

One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.
Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

Related

auto-correct the words from the list in python

I want to auto-correct the words which are in my list.
Say I have a list
kw = ['tiger','lion','elephant','black cat','dog']
I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.
Now I have list of str
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]
Expected output:
['tiger','lion',None,'dog']
My Efforts:
import difflib
op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)
My Output:
[[], [], [], ['dog']]
The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).
If I lower the cutoff value it starts returning the words which is should not.
So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.
So is there way to implement this?
I have explored few more libraries like autocorrect, hunspell etc. but no success.
You could implement something based of levenshtein distance.
It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html
Clearly, bieber is a long way from beaver—they are too far apart to be
considered a simple misspelling. Damerau observed that 80% of human
misspellings have an edit distance of 1. In other words, 80% of
misspellings could be corrected with a single edit to the original
string.
Elasticsearch supports a maximum edit distance, specified with the
fuzziness parameter, of 2.
Of course, the impact that a single edit has on a string depends on
the length of the string. Two edits to the word hat can produce mad,
so allowing two edits on a string of length 3 is overkill. The
fuzziness parameter can be set to AUTO, which results in the following
maximum edit distances:
0 for strings of one or two characters
1 for strings of three, four, or five characters
2 for strings of more than five characters
I like to use pyxDamerauLevenshtein myself.
pip install pyxDamerauLevenshtein
So you could do a simple implementation like:
keywords = ['tiger','lion','elephant','black cat','dog']
from pyxdameraulevenshtein import damerau_levenshtein_distance
def correct_sentence(sentence):
new_sentence = []
for word in sentence.split():
budget = 2
n = len(word)
if n < 3:
budget = 0
elif 3 <= n < 6:
budget = 1
if budget:
for keyword in keywords:
if damerau_levenshtein_distance(word, keyword) <= budget:
new_sentence.append(keyword)
break
else:
new_sentence.append(word)
else:
new_sentence.append(word)
return " ".join(new_sentence)
Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.
Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:
def find_similar_word(s, kw, thr=0.5):
from difflib import SequenceMatcher
out = []
for i in s:
f = False
for j in i.split():
for k in kw:
if SequenceMatcher(a=j, b=k).ratio() > thr:
out.append(k)
f = True
if f:
break
if f:
break
else:
out.append(None)
return out
Output
find_similar_word(s, kw)
['tiger', 'lion', None, 'dog']
Although this is slightly different from your expected output (it is a list of list instead of a list of string) I thing it is a step in the right direction. The reason I chose this method, is so that you can have multiple corrections per sentence. That is why I added another example sentence.
import difflib
import itertools
kw = ['tiger','lion','elephant','black cat','dog']
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs", "A tyger is different from a doog"]
op = [[difflib.get_close_matches(j,kw,cutoff=0.5) for j in i.split()] for i in s]
op = [list(itertools.chain(*o)) for o in op]
print(op)
The output is generate is:
[['tiger'], ['lion'], [], ['dog'], ['tiger', 'dog']]
The trick is to split all the sentences along the whitespaces.

Python Text processing (str.contains)

I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.

Python: Finding out if certain words in a list are actual English words or close to English words

I am working on a problem where I get a lot of words with their frequency of occurrence listed. Here is a sample of what I get:
drqsQAzaQ:1
OnKxnXecCINJ:1
QoGzQpg:1
cordially:1
Sponsorship:1
zQnpzQou:1
Thriving:1
febrero:1
rzaye:1
VseKEX:1
contributed:1
SNfXQoWV:1
hRwzmPR:1
Happening:1
TzJYAMWAQUIJTkWYBX:1
DYeUIqf:1
formats:1
eiizh:1
wIThY:1
infonewsletter:8
BusinessManager:10
MailScanner:12
As you can see, words like 'cordially' are actual English words, while words like 'infonewsletter' are not actual English words by themselves, but we can see that they are actually in English and mean something. However, words like 'OnKxnXecCINJ' do not mean anything (actually they are words from another charset, but I am ignoring them in my exercise and sticking to English) - I can discard them as junk
What would be the best method in Python to detect and eliminate such junk words from a given dictionary such as the one above?
I tried examining each word using nltk.corpus.word.words(), but it is killing my performance as my data set is very huge. Moreover, I am not certain whether this will give me a True for words like 'infonewsletter'
Please help.
Thanks,
Mahesh.
If the words are from completely different script within Unicode like CJK characters or Greek, Cyrillic, Thai, you could use unicodedata.category to see if they're letters to begin with (category starts with L):
>>> import unicodedata
>>> unicodedata.category('a')
'Ll'
>>> unicodedata.category('E')
'Lu'
>>> unicodedata.category('中')
'Lo'
>>> [unicodedata.category(i).startswith('L') for i in 'aE中,']
[True, True, True, False]
Then you can use the unicodedata.name to see that they're Latin letters:
>>> 'LATIN' in unicodedata.name('a')
True
>>> 'LATIN' in unicodedata.false('中')
False
Presumably it is not an English-language word if it has non-Latin letters in it.
Otherwise, you could use a letter bigram/trigram classifier to find out if there is a high probability these are English words. For example OnKxnXecCINJ contains Kxn which is a trigram that neither very probably exist in any single English language word, nor any concatenation of 2 words.
You can build one yourself from the corpus by splitting words into character trigrams, or you can use any of the existing libraries like langdetect or langid or so.
Also, see that the corpus is a set for fast in operations; only after the algorithm tells that there is a high probability it is in English, and the word fails to be found in the set, consider that it is alike to infonewsletter - a concatenation of several words; split it recursively into smaller chunks and see that each part thereof is found in the corpus.
Thank you. I am trying out this approach. However, I have a question. I have a word 'vdgutumvjaxbpz'. I know this is junk. I wrote some code to get all grams of this word, 4-gram and higher. This was the result:
['vdgu', 'dgut', 'gutu', 'utum', 'tumv', 'umvj', 'mvja', 'vjax', 'jaxb', 'axbp', 'xbpz', 'vdgut', 'dgutu', 'gutum', 'utumv', 'tumvj', 'umvja', 'mvjax', 'vjaxb', 'jaxbp', 'axbpz', 'vdgutu', 'dgutum', 'gutumv', 'utumvj', 'tumvja', 'umvjax', 'mvjaxb', 'vjaxbp', 'jaxbpz', 'vdgutum', 'dgutumv', 'gutumvj', 'utumvja', 'tumvjax', 'umvjaxb', 'mvjaxbp', 'vjaxbpz', 'vdgutumv', 'dgutumvj', 'gutumvja', 'utumvjax', 'tumvjaxb', 'umvjaxbp', 'mvjaxbpz', 'vdgutumvj', 'dgutumvja', 'gutumvjax', 'utumvjaxb', 'tumvjaxbp', 'umvjaxbpz', 'vdgutumvja', 'dgutumvjax', 'gutumvjaxb', 'utumvjaxbp', 'tumvjaxbpz', 'vdgutumvjax', 'dgutumvjaxb', 'gutumvjaxbp', 'utumvjaxbpz', 'vdgutumvjaxb', 'dgutumvjaxbp', 'gutumvjaxbpz', 'vdgutumvjaxbp', 'dgutumvjaxbpz', 'vdgutumvjaxbpz']
Now, I compared each gram result with nltk.corpus.words.words() and found the intersection of the 2 sets.
vocab = nltk.corpus.words.words()
vocab = set(w.lower().strip() for w in vocab)
def GetGramsInVocab(listOfGrams, vocab):
text_vocab = set(w.lower() for w in listOfGrams if w.isalpha())
common = text_vocab & vocab
return list(common)
However, the intersection contains 'utum', whereas I expected it to be NULL.
Also,
print("utum" in vocab)
returned true.
This does not make sense to me. I peeked into the vocabulary and found 'utum' in a few words like autumnian and metascutum
However, 'utum' is not a word by itself and I expected nltk to return false. Is there a more accurate corpus I can check against that would do whole word comparisons?
Also, I did a simple set operations test:
set1 = {"cutums" "acutum"}
print("utum" in set1)
This returned False as expected.
I guess I am confused as to why the code says 'utum' is present in the nltk words corpus.
Thanks,
Mahesh.

Probability count/frequency of related words?

I'm looking for a way to produce a numerical probability value for single words which share a common root word/meaning.
Users will generate content using words like "dancer," "dancing," "danced."
If "dancer" is submitted 30 times, and dancing 5 times, I need just one value "dance:35" which catches all of them.
But when users also submit words like "accordance," it should not effect my "dance" count, but instead, add to a sperate count along with words like "according" and "accordingly."
Also, I don't have a pre-defined list of root words to look out for. I'll need to create it dynamically based on that user-generated content.
So my question is, what's probably the best way to pull this off? I'm sure there will be no perfect solution, but I figure someone on here can probably can come up with a better way than me.
My idea so far is to assume that most meaningful words will have at least 3 or 4 letters. So for every word I encounter whose length is greater than 4, trim it down to 4 ("dancers" becomes "danc"), check my list of words to see if I've encountered it before, if so - increment it's count, and if not - add it to that list, repeat.
I see there are some similar questions on here. But I haven't found any answer which considers roots AND I can implement in python. Answers seem to be for one or the other.
You don't need a Python wrapper for a Java lib, nltk's got Snowball! :)
>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'
Stemming isn't always going to give you exact roots, but it's a great start.
Following is an example of using stems. I'm building a dictionary of stem: (word, count) while choosing the shortest word possible for each stem. So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}
Example Code: (text taken from http://en.wikipedia.org/wiki/Dance)
import re
from nltk.stem import SnowballStemmer as SS
text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]
stemmer = SS('english')
counts = dict()
#count stems and extract shortest words possible
for word in words:
stem = stemmer.stem(word)
if stem in counts:
shortest,count = counts[stem]
if len(word) < len(shortest):
shortest = word
counts[stem] = (shortest,count+1)
else:
counts[stem]=(word,1)
#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
print '%s:%d (Root: %s)' % item
Output:
dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---
I wouldn't recommend lemmatization for your specific needs:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'
Substrings are not a good idea because it will always fail at a certain point, and many times fail miserably.
fixed-length: pseudo-words 'dancitization' and 'dancendence' will match 'dance' at 4 and 5 characters respectively.
ratio: a low ratio will return fakes (like above)
ratio: a high ratio will not match enough (e.g. 'running')
But with stemming you get:
>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'
Which is an impressive no-match result for the stem "danc". Even considering 'dancer' does not stem to 'danc', the accuracy is pretty high overall.
I hope this helps you get started.
What you are looking for is also known as the stems (more technical perspective than the linguistic "root") of the words. You are right in assuming that there is no perfect solution, all approaches will either have imperfect analysis or lack coverage.
Basically the best way to go is either with a word-list that contains stems, or a stemming algorithm.
Check the first answer here to get a python-based solution:
How do I do word Stemming or Lemmatization?
I am using Snowball in all my Java-based projects and it works perfect for my purposes (it is very fast, too, and covers a wide range of languages). It also seems to have a Python wrapper:
http://snowball.tartarus.org/download.php

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()
From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]
What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Categories

Resources