NER naive algorithm - python

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Related

How to classify derived words that share meaning as the same tokens?

I would like to count unrelated words in an article but I have troubles with grouping words of the same meaning derived from one another.
For instance, I would like gasoline and gas to be treated as the same token in sentences like The price of gasoline has risen. and "Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol". Therefore, if these two sentences comprised the entire article, the count for gas (or gasoline) would be 3 (petrol would not be counted).
I have tried using NLTK's stemmers and lemmatizers but to no avail. Most seem to reproduce gas as gas and gasoline as gasolin which is not helpful for my purposes at all. I understand that this is the usual behaviour. I have checked out a thread that seems to be a little bit similar, however the answers there are not completely applicable to my case as I require the words to be derived from one another.
How to treat derived words of the same meaning as same tokens in order to count them together?
I propose a two steps approach:
First, find synonyms by comparing word embeddings (only non-stopwords). This should remove similar written words, which mean something else, such as gasolineand gaseous.
Then, check if synonyms share some of their stem. Essentially if "gas" is in "gasolin" and the other way around. This shall suffice because you only compare your synonyms.
import spacy
import itertools
from nltk.stem.porter import *
threshold = 0.6
#compare the stems of the synonyms
stemmer = PorterStemmer()
def compare_stems(a, b):
if stemmer.stem(a) in stemmer.stem(b):
return True
if stemmer.stem(b) in stemmer.stem(a):
return True
return False
candidate_synonyms = {}
#add a candidate to the candidate dictionary of sets
def add_to_synonym_dict(a,b):
if a not in candidate_synonyms:
if b not in candidate_synonyms:
candidate_synonyms[a] = {a, b}
return
a, b = b,a
candidate_synonyms[a].add(b)
nlp = spacy.load('en_core_web_lg')
text = u'The price of gasoline has risen. "Gas" is a colloquial form of the word gasoline in North American English. Conversely in BE the term would be petrol. A gaseous state has nothing to do with oil.'
words = nlp(text)
#compare every word with every other word, if they are similar
for a, b in itertools.combinations(words, 2):
#check if one of the word pairs are stopwords or punctuation
if a.is_stop or b.is_stop or a.is_punct or b.is_punct:
continue
if a.similarity(b) > threshold:
if compare_stems(a.text.lower(), b.text.lower()):
add_to_synonym_dict(a.text.lower(), b.text.lower())
print(candidate_synonyms)
#output: {'gasoline': {'gas', 'gasoline'}}
Then you can count your synonym candidates based on their appearances in the text.
Note: I chose the threshold for synonyms with 0.6 by chance. You would probably test which threshold suits your task. Also my code is just a quick and dirty example, this could be done a lot cleaner.
`

Matching two-word variants with each other if they don't match alphabetically

I'm doing a NLP project with my university, collecting data on words in Icelandic that exist both spelled with an i and with a y (they sound the same in Icelandic fyi) where the variants are both actual words but do not mean the same thing. Examples of this would include leyti (an approximation in time) and leiti (a grassy hill), or kirkja (church) and kyrkja (choke). I have a dataset of 2 million words. I have already collected two wordlists, one of which includes words spelled with a y and one includes the same words spelled with a i (although they don't seem to match up completely, as the y-list is a bit longer, but that's a separate issue). My problem is that I want to end up with pairs of words like leyti - leiti, kyrkja - kirkja, etc. But, as y is much later in the alphabet than i, it's no good just sorting the lists and pairing them up that way. I also tried zipping the lists while checking the first few letters to see if I can find a match but that leaves out all words that have y or i as the first letter. Do you have a suggestion on how I might implement this?
Try something like this:
s = "trydfydfgfay"
l = list(s)
candidateWords = []
for idx, c in enumerate(l):
if c=='y':
newList = l.copy()
newList[idx] = "i"
candidateWord = "".join(newList)
candidateWords.append(candidateWord)
print(candidateWords)
#['tridfydfgfay', 'trydfidfgfay', 'trydfydfgfai']
#look up these words to see if they are real words
I do not think this is a programming challenge, but looks more like an NLP challenge itself. The spelling variations are often a hurdle that one would face during preprocessing.
I would suggest that you use an Edit-distance based approaches to identify word pairs that allow for some variations. Specifically for the kind of problem that you have described above, I would recommend "Jaro Winkler Distance". This method allows to give higher similarity score between word pairs that shows variations between specific character pairs, say y and i.
All these approaches are implemented in Jellyfish library.
You may also have a look at the fuzzywuzzy package as well. Hope this helps.
So this accomplishes my task, kind of an easy not-that-pretty solution I suppose but it works:
wordlist = open("data.txt", "r", encoding='utf-8')
y_words = open("y_wordlist.txt", "w+", encoding='utf-8')
all_words = []
y_words = []
for word in wordlist:
word = word.lower()
all_words.append(word)
for word in all_words:
if "y" in word:
y_words.append(word)
word_dict = {}
for word in y_words:
newwith1y = word.replace("y", "i",1)
newwith2y = word.replace("y", "i",2)
newyback = word[::-1].replace("y", "i",1)
newyback = newyback[::-1]
word_dict[word] = newwith1y
word_dict[word] = newwith2y
word_dict[word] = newyback
for key, value in word_dict.items():
if value in all_words:
y_wordlist.write(key)
y_wordlist.write(" - ")
y_wordlist.write(value)
y_wordlist.write("\n")

Optimizing the process of finding word association strengths from an input text

I have written the following (crude) code to find the association strengths among the words in a given piece of text.
import re
## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text) ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text) ## Remove special characters except spaces and dots
text=str(text).lower() ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.
from nltk.corpus import stopwords
import nltk
stop_words = stopwords.words('english')
desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results
word_list = []
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.append(word)
'''
Construct the association matrix, where we count 2 words as being associated
if they appear in the same sentence.
Later, I'm going to define associations more properly by introducing a
window size (say, if 2 words seperated by at most 5 words in a sentence,
then we consider them to be associated)
'''
table = np.zeros((len(word_list),len(word_list)), dtype=int)
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
df = pd.DataFrame(table, columns=word_list, index=word_list)
# Count the number of occurrences of each word from word_list in the text
all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index
for sent in text.split('.'):
count=0
for word in sent.split():
if word in word_list:
all_words.loc[all_words.Word==word,'Count'] += 1
# Sort the word pairs in decreasing order of their association strengths
df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0
assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
for col_word in df:
'''
If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
'''
assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word,
'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)
assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)
This produces the word associations like so:
Word 1 Word 2 Association Strength (Word 1 -> Word 2)
330 wiki encyclopedia 3.0
895 encyclopadia found 1.0
1317 anyone edit 1.0
754 peer science 1.0
755 peer encyclopadia 1.0
756 peer britannica 1.0
...
...
...
However, the code contains a lot of for loops which hampers its running time. Specially the last part (sort the word pairs in decreasing order of their association strengths) consumes a lot of time as it computes the association strengths of n^2 word pairs/combinations, where n is the number of words we are interested in (those in word_list in my code above).
So, the following are what I would like some help on:
How do I vectorize the code, or otherwise make it more efficient?
Instead of producing n^2 combinations/pairs of words in the last step, is there any way to prune some of them before producing them? I am going to prune some of the useless/meaningless pairs by inspection after they are produced anyway.
Also, and I know this does not fall into the purview of a coding question, but I would love to know if there's any mistake in my logic, specially when calculating the word association strengths.
Since you asked about your specific code, I will not go into alternate libraries. I will be mostly concentrating on points 1) and 2) of your question:
Instead of iterating through the whole word ist twice (i and j) you can already cut the processing time by ~ half by just iterating j between i + i and the end of the list. This removes duplicate pairs (index 24 and 42 as well as index 42 and 24) as well as the identical pair (index 42 and 42).
for sent in text.split('.'):
for i in range(len(word_list)):
for j in range(i+1, len(word_list)):
if word_list[i] in sent and word_list[j] in sent:
table[i,j]+=1
Be careful with this, though. the in operator will also match partial words (like and in hand)
Of course, you could also remove the j iteration completely by first filtering for all words in your word list and then pairing them afterward:
word_list = set() # Using set instead of list makes lookups faster since this is a hashed structure
for sent in text.split('.'):
for word in sent.split():
'''
Extract the unique, non-stopword nouns only
'''
if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
word_list.add(word)
(...)
for sent in text.split('.'):
found_words = [word for word in sent.split() if word in word_list] # list comprehensions are usually faster than pure for loops
# If you want to count duplicate words, then leave the whole line below out.
found_words = tuple(frozenset(found_words)) # make every word unique using a set and then iterable by index again by converting it into a tuple.
for i in range(len(found_words):
for j in range(i+1, len(found_words):
table[i, j] += 1
In general, you should really think about using external libraries for most of this, though. As some of the comments on your question already pointed out, splitting on . may get you wrong results, the same counts for splitting on whitespace, for example with words that are separated with a - or words that are followed by a ,.

Python Text processing (str.contains)

I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.

How can I make this code work faster ? (searching large corpus of text for large words)

in Python, I have created a text generator that acts on certain parameters but my code is -at most of the time- slow and performs below my expectations. I expect one sentence per every 3-4 minutes but it fails to comply if the database it works on is large -I use the project Gutenberg's 18-book corpus and I will create my custom corpus and add further books so performance is vital.- The algorithm and the implementation is below:
ALGORITHM
1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5
IMPLEMENTATION
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
The computer I run the code on is a bit old, dual-core CPU was bought in Feb. 2006 and 2x512 RAM was bought in Sept. 2004 so I'm not sure if my implementation is bad or the hardware is a reason for the slow runtime. Any ideas on how I can rescue this from its hazardous form ? Thanks in advance.
I think my first advice must be: Think carefully about what your routines do, and make sure the name describes that. Currently you have things like:
arraySorter which neither deals with arrays nor sorts (it's an implementation of nub)
findLongestWord which counts things or selects words by criteria not present in the algorithm description, yet ends up doing nothing at all because longestWord is a local variable (argument, as it were)
getSentence which appends an arbitrary number of sentences onto a list
appending which sounds like it might be a state checker, but operates only through side effects
considerable confusion between local and global variables, for instance the global variable sentenceAppender is never used, nor is it an actor (for instance, a function) like the name suggests
For the task itself, what you really need are indices. It might be overkill to index every word - technically you should only need index entries for words that occur as the longest word of a sentence. Dictionaries are your primary tool here, and the second tool is lists. Once you have those indices, looking up a random sentence containing any given word takes only a dictionary lookup, a random.choice, and a list lookup. Perhaps a few list lookups, given the sentence length restriction.
This example should prove a good object lesson that modern hardware or optimizers like Psyco do not solve algorithmic problems.
Maybe Psyco speeds up the execution?

Categories

Resources