I'm running the chunk of code under 3. Word to Vectors Integration from this NLP tutorial. It uses Spacy's lexemes to calculate the most similar words to whatever word you give it - I'm using it to try to find the nearest synonyms to the given word. However, if you replace apple with a word like "look" you get a lot of related words, but not synonyms (examples: pretty, there, over, etc.). I was thinking of modifying the code to also filter by part of speech, so that I could just get verbs in the output and would be able to go from that. To do that, I'd need to use tokens so I can use token.pos_, since that function isn't available for lexemes. Does anyone know a way to take the output (list called "others" in the code) and change it from a lexeme to a token? I was reading over spacy's information document for lexemes here, but I haven't been able to find anything about transforming.
I've also tried adding a section of code at the end to the other person's code:
from numpy import dot
from numpy.linalg import norm
import spacy
from spacy.lang.en import English
nlp = English()
parser = spacy.load('en_core_web_md')
my_word = u'calm'
#Generate word vector of the word - apple
apple = parser.vocab[my_word]
#Cosine similarity function
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
others = list({w for w in parser.vocab if w.has_vector and w.orth_.islower()
and w.lower_ != my_word})
print("done listing")
# sort by similarity score
others.sort(key=lambda w: cosine(w.vector, apple.vector))
others.reverse()
for word in others[:10]:
print(word.orth_)
The part I added:
b = ""
for word in others[:10]:
a = str(word) + ' '
b += a
doc = nlp(b)
print(doc)
token = doc[0]
counter = 1
while counter < 50:
token += doc[counter]
counter += 1
print(token)
This is the output error:
'token += doc[counter]
TypeError: unsupported operand type(s) for +=: 'spacy.tokens.token.Token' and 'spacy.tokens.token.Token'
<spacy.lexeme.Lexeme object at 0x000002920ABFAA68> <spacy.lexeme.Lexeme object at 0x000002920BD56EE8> '
Does anyone have any suggestions to fix what I did or another way to change the lexeme to a token? Thank you!
You have to create a Doc to get a Token, as the Token doesn't own any data -- it's just a view. So, you could do something like:
doc = Doc(lex.vocab, words=[lex.orth_])
Related
I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
Sentences
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
lenghts=range(min_length,max_length+1)
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?
Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams
I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
print(lemma_word)
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)
Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
print(freq)
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
print(freq['Lorem'])
>>> 4
Alright just to give some time reference, I have used this script,
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
Results,
149.01646934738716
foo: 9998872
18.093295297389773
Still I am not sure if this is enough for you.
Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.
As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.
It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.
Solution
First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.
import spacy
from spacy.attrs import ORTH
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple apple orange banana")
Then we create a dictionary of word counts
count_dict = doc.count_by(ORTH)
You could count by other attributes like LEMMA, just import the attribute you wish to use.
If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.
count_dict
Results:
{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}
We can get the text for the word if we look up the hash in the vocab.
nlp.vocab.strings[8566208034543834098]
Returns
'apple'
With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.
def get_word_count(word, count_dict):
return count_dict[nlp.vocab.strings[word]]
If we run the function with our search word 'apple' and the count dict we created earlier
get_word_count('apple', count_dict)
We get:
2
https://spacy.io/api/doc#count_by
Is it possible to get concordance for a phrase in NLTK?
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)
corpus.concordance("claim")
for example the above returns
on okay okay okay i can give you the claim number and my information and
decide on the shop okay okay so the claim number is xxxx - xx - xxxx got
and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance.
According to this issue it is not (yet) possible to search for multiple words with the concordance() function.
If you read the discussion under the very issue that #b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this:
>>> from nltk.app import concordance
>>> concordance()
I munged together this solution...
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
#concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
phraseList=phrase.split(' ')
c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
#Find the offset for each token in the phrase
offsets=[c.offsets(x) for x in phraseList]
offsets_norm=[]
#For each token in the phraselist, find the offsets and rebase them to the start of the phrase
for i in range(len(phraseList)):
offsets_norm.append([x-i for x in offsets[i]])
#We have found the offset of a phrase if the rebased values intersect
#--
# http://stackoverflow.com/a/3852792/454773
#the intersection method takes an arbitrary amount of arguments
#result = set(d[0]).intersection(*d[1:])
#--
intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
for offset in intersects])
outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
return outputs
def n_concordance(txt,phrase,left_margin=5,right_margin=5):
tokens = nltk.word_tokenize(txt)
text = nltk.Text(tokens)
return
n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)
n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
u'; for Whales of a monstrous size are oftentimes cast up dead ']
I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer.
Here is a random text that output what I was expecting
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective
Output: 'bad'
lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb
Output: 'worse'
Well, everything here is fine. The behaviour is the same with other adjectives like 'better' (for an irregular form) or 'older' (note that the same test with 'elder' will never output 'old', but I guess that wordnet is not an exhaustive list of all the existing english word)
My question comes when trying with the word 'furter':
lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective
Output: 'further'
lem.lemmatize('further', pos=wordnet.ADV) // as an adverb
Output: 'far'
This is the exact opposite behaviour of the one for the 'worse' word!
Can anybody explain me why ? Is it a bug coming from the wordnet synsets data or does it come from my misunderstanding of the english grammar ?
Please excuse me if the question is already answered, I've search on google and SO, but when specifying the keyword "further", I can find anything related but mess because of the popularity of this word...
Thank you in advance,
Romain G.
WordNetLemmatizer uses the ._morphy function to access its a word's lemma; from http://www.nltk.org/_modules/nltk/stem/wordnet.html and returns the possible lemmas with the minimum length.
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
And the ._morphy function apply rules iteratively to get a lemma; the rules keep reducing the length of the word and substituting the affixes with the MORPHOLOGICAL_SUBSTITUTIONS. then it sees whether there are other words that are shorter but the same as the reduced word:
def _morphy(self, form, pos):
# from jordanbg:
# Given an original string x
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database
# 3. If there are no matches, keep applying rules until you either
# find a match or you can't go any further
exceptions = self._exception_map[pos]
substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
def apply_rules(forms):
return [form[:-len(old)] + new
for form in forms
for old, new in substitutions
if form.endswith(old)]
def filter_forms(forms):
result = []
seen = set()
for form in forms:
if form in self._lemma_pos_offset_map:
if pos in self._lemma_pos_offset_map[form]:
if form not in seen:
result.append(form)
seen.add(form)
return result
# 0. Check the exception lists
if form in exceptions:
return filter_forms([form] + exceptions[form])
# 1. Apply rules once to the input to get y1, y2, y3, etc.
forms = apply_rules([form])
# 2. Return all that are in the database (and check the original too)
results = filter_forms([form] + forms)
if results:
return results
# 3. If there are no matches, keep applying rules until we find a match
while forms:
forms = apply_rules(forms)
results = filter_forms(forms)
if results:
return results
# Return an empty list if we can't find anything
return []
However if the word is in the list of exceptions, it will return a fixed value kept in the exceptions, see _load_exception_map in http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:
def _load_exception_map(self):
# load the exception file data into memory
for pos, suffix in self._FILEMAP.items():
self._exception_map[pos] = {}
for line in self.open('%s.exc' % suffix):
terms = line.split()
self._exception_map[pos][terms[0]] = terms[1:]
self._exception_map[ADJ_SAT] = self._exception_map[ADJ]
Going back to your example, worse -> bad and further -> far CANNOT be achieved from the rules, thus it has to be from the exception list. Since it's an exception list, there are bound to be inconsistencies.
The exception list are kept in ~/nltk_data/corpora/wordnet/adv.exc and ~/nltk_data/corpora/wordnet/adv.exc.
From adv.exc:
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard
From adj.exc:
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...
Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.