I am using nltk to generate n-grams from sentences by first removing given stop words. However, nltk.pos_tag() is extremely slow taking up to 0.6 sec on my CPU (Intel i7).
The output:
['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149
The code:
for sentence in source:
nltk_ngrams = None
if stop_words is not None:
start = time.time()
sentence_pos = nltk.pos_tag(word_tokenize(sentence))
print time.time() - start
filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
else:
filtered_words = ngrams(sentence.split(), n)
Is this really that slow or am I doing something wrong here?
Use pos_tag_sents for tagging multiple sentences:
>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start
0.939551115036
nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
so each call to pos_tag instantiates the perceptrontagger module which takes much of the computation time.You can save this time by directly calling tagger.tag yourself as:
from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger()
sentence_pos = tagger.tag(word_tokenize(sentence))
If you are looking for another POS tagger with fast performances in Python, you might want to try RDRPOSTagger. For example, on English POS tagging, the tagging speed is 8K words/second for a single threaded implementation in Python, using a computer of Core 2Duo 2.4GHz. You can get faster tagging speed by simply using the multi-threaded mode. RDRPOSTagger obtains very competitive accuracies in comparison to state-of-the-art taggers and now supports pre-trained models for 40 languages. See experimental results in this paper.
Related
As part of pre-processing for a text classification model, I have added stopword removal and lemmatization steps, using the NLTK library. The code is below:
import pandas as pd
import nltk; nltk.download("all")
from nltk.corpus import stopwords; stop = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# Stopwords removal
def remove_stopwords(entry):
sentence_list = [word for word in entry.split() if word not in stopwords.words("english")]
return " ".join(sentence_list)
df["Description_no_stopwords"] = df.loc[:, "Description"].apply(lambda x: remove_stopwords(x))
# Lemmatization
lemmatizer = WordNetLemmatizer()
def punct_strip(string):
s = re.sub(r'[^\w\s]',' ',string)
return s
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
def lemmatize_rows(entry):
sentence_list = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punct_strip(entry).split()]
return " ".join(sentence_list)
df["Description - lemmatized"] = df.loc[:, "Description_no_stopwords"].apply(lambda x: lemmatize_rows(x))
The problem is that, when I pre-process a dataset with 27k entries (my test set), it takes 40-45 seconds for stopwords removal and just as long for lemmatization. By contrast, model evaluation only takes 2-3 seconds.
How can I re-write the functions to optimise computation speed? I have read something about vectorization, but the example functions were much simpler than the ones that I have reported, and I wouldn't know how to do it in this case.
A similar question was asked here and suggests that you try caching the stopwords.words("english") object. In your method remove_stopwords you are creating the object every time you evaluate an entry. So, you can definitely improve that. Regarding your lemmatizer, as mentioned here, you can also cache your results to improve performance. I can imagine that your pandas operations are also quite expensive. You may consider converting your dataframe into an array or dictionary and then iterating over it. If you need a dataframe later, you can easily convert it back.
Is there a more efficient way of doing this?
My code reads a text file and extracts all Nouns.
import nltk
File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns
for sentence in sentences:
for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
nouns.append(word)
How do I reduce the time complexity of this code? Is there a way to avoid using the nested for loops?
Thanks in advance!
If you are open to options other than NLTK, check out TextBlob. It extracts all nouns and noun phrases easily:
>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']
import nltk
lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print nouns
>>> ['lines', 'string', 'words']
Useful tip: it is often the case that list comprehensions are a faster method of building a list than adding elements to a list with the .insert() or append() method, within a 'for' loop.
You can achieve good results using nltk, Textblob, SpaCy or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.
import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
On my windows 10 2 cores, 4 processors, 8GB ram i5 hp laptop, in jupyter notebook, I ran some comparisons and here are the results.
For TextBlob:
%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])
And the output is
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 8.01 ms #average over 20 iterations
For nltk:
%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])
And the output is
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 7.09 ms #average over 20 iterations
For spacy:
%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])
And the output is
>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 30.19 ms #average over 20 iterations
It seems nltk and TextBlob are reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCy missed the noun NLP while nltk and TextBlob got it. I would shot for nltk or TextBlob unless there is something else I wish to extract from the input txt.
Check out a quick start into spacy here.
Check out some basics about TextBlob here. Check out nltk HowTos here
import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)
Just simplied abit more.
I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.
Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.
Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.
The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.
In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the if clause that matches the POS tags), but it's not going to change anything efficiency-wise.
I am new to Python and NLTK so please bear with me. I wish to find the sense of a word in the context of a sentence. I am using the Lesk WSD algorithm but it is giving different outputs every time I run it. I know that Lesk has some level of inaccuracy. But, I think a POS tag will increase accuracy.
The Lesk algorithm takes a POS tag as an argument, but it takes 'n','s','v' as an input and not 'NN','VBP' or other POS tags which are outputted by the pos_tag() function. I would like to know how to tag words in the form of 'n','s','v', or if there is a method in which I can convert the 'NN','VBP' and other tags into 'n','s','v', so I can give them as an input to the lesk(context_sentence,word,pos_tag) function.
I am calculating the sentiment score of every word using SentiWordNet afterwards.
from nltk.wsd import lesk
from nltk import word_tokenize
import nltk, re, pprint
from nltk.corpus import sentiwordnet as swn
def word_sense():
sent = word_tokenize("He should be happy.")
word = "be"
pos = "v"
score = lesk(sent,word,pos)
print(score)
print (str(score),type(score))
set1 = re.findall("'([^']*)'",str(score))[0]
print (set1)
bank = swn.senti_synset(str(set1))
print (bank)
word_sense()
nltk.wsd.lesk does not return score, it returns the predicted Synset:
>>> from nltk.corpus import wordnet as wn
>>> from nltk.corpus import sentiwordnet as swn
>>> from nltk import word_tokenize
>>> from nltk.wsd import lesk
>>> sent = word_tokenize("He should be happy".lower())
>>> lesk(sent, 'be', 'v')
Synset('equal.v.01')
lesk is not perfect, it should only be used as a baseline system for WSD.
Although this is nice:
>>> ss = str(lesk(sent, 'be', 'v'))
>>> re.findall("'([^']*)'",ss)
['equal.v.01']
There's a simpler to get the synset identifier:
>>> lesk(sent, 'be', 'v').name()
u'equal.v.01'
Then you can do:
>>> swn.senti_synset(lesk(sent, 'be', 'v').name())
SentiSynset('equal.v.01')
To convert POS tag to WN POS, you can simply try: Converting POS tags from TextBlob into Wordnet compatible inputs
This question already has answers here:
Word sense disambiguation in NLTK Python
(6 answers)
Closed 9 years ago.
I'm developing a simple NLP project, and I'm looking, given a text and a word, find the most likely sense of that word in the text.
Is there any implementation of wsd algorithms in Python? It's not quite clear whether there is something in NLTK that can help me. I'd be happy even with a naive implementation like Lesk Algorithm.
I've read similar questions like Word sense disambiguation in NLTK Python but they give nothing but a reference to a NLTK book, which is not very into WSD problem.
In short: https://github.com/alvations/pywsd
In long: There are endless techniques used for WSD, ranging from mind-blasting machine techniques that requires lots of GPU power to simply using information in wordnet or even just using frequencies, see http://dl.acm.org/citation.cfm?id=1459355 .
Let's start with the simple lesk algorithm that allows optional stems, see http://en.wikipedia.org/wiki/Lesk_algorithm:
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']
plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']
ps = PorterStemmer()
def lesk(context_sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True):
max_overlaps = 0; lesk_sense = None
context_sentence = context_sentence.split()
for ss in wn.synsets(ambiguous_word):
# If POS is specified.
if pos and ss.pos is not pos:
continue
lesk_dictionary = []
# Includes definition.
lesk_dictionary+= ss.definition.split()
# Includes lemma_names.
lesk_dictionary+= ss.lemma_names
# Optional: includes lemma_names of hypernyms and hyponyms.
if hyperhypo == True:
lesk_dictionary+= list(chain(*[i.lemma_names for i in ss.hypernyms()+ss.hyponyms()]))
if stem == True: # Matching exact words causes sparsity, so lets match stems.
lesk_dictionary = [ps.stem(i) for i in lesk_dictionary]
context_sentence = [ps.stem(i) for i in context_sentence]
overlaps = set(lesk_dictionary).intersection(context_sentence)
if len(overlaps) > max_overlaps:
lesk_sense = ss
max_overlaps = len(overlaps)
return lesk_sense
print "Context:", bank_sents[0]
answer = lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print
print "Context:", bank_sents[1]
answer = lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print
print "Context:", plant_sents[0]
answer = lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print
Other than lesk-like algorithms, there are different similarity measures that people tried, a good but out-dated but still useful survey: http://acl.ldc.upenn.edu/P/P97/P97-1008.pdf
You can try getting the first sense for each word using the WordNet incorporated in NLTK, using this short code:
from nltk.corpus import wordnet as wn
def get_first_sense(word, pos=None):
if pos:
synsets = wn.synsets(word,pos)
else:
synsets = wn.synsets(word)
return synsets[0]
best_synset = get_first_sense('bank')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','n')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','v')
print '%s: %s' % (best_synset.name, best_synset.definition)
Will print:
bank.n.01: sloping land (especially the slope beside a body of water)
set.n.01: a group of things of the same kind that belong together and are so used
put.v.01: put into a certain place or abstract location
This works quite well, surprisingly, as the first sense is often dominating the other senses.
For WSD in Python you can try to use Wordnet bindings in NLTK or Gensim library. The building blocks are there, but developing the complete algorithm is, probably, on you.
For instance, using Wordnet you can implement a Simplified Lesk algorithm, as described in the Wikipedia entry.
I can get the sense offset from a princeton WN sense as marked in the NTLK corpus library:
[in]:'dog.n.01'
>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synset('dog.n.01')
>>> offset = str(ss.offset).zfill(8)+"-"+ss.pos
>>> print offset
[out]:'02084071-n'
That offset is similar to the convention used in http://casta-net.jp/~kuribayashi/cgi-bin/wn-multi.cgi?synset=02084071-n&lang=eng
How can i do the reverse without looping through the whole wordnet corpus? where:
[in]: '02084071-n'
[out]: 'dog.n.01' or Synset('dog.n.01')
I could do this but it's just way way too long and too many redundant cycles:
[in]: '02084071-n'
in_offset, in_pos = "02084071-n".split("-")
from nltk.corpus import wordnet as wn
nltk_ss = [i for i in wn.all_synsets() if i.offset == int(in_offset) and i.pos == in_pos][0]
print nltk_ss
[out]: Synset('dog.n.01')
Unfortunately, you cannot reverse lookup without iterating over the corpus at least once (like you have shown). The only thing I can suggest would be to keep it in a dictionary if you are going to be looking up synsets based on offsets multiple times.
>>> senseIdToSynset = {s.offset:s for s in wn.all_synsets()}
>>> senseIdToSynset[2084071]
Synset('dog.n.01')