How can I lemmatize a list of sentences in Python?
from nltk.stem.wordnet import WordNetLemmatizer
a = ['i like cars', 'cats are the best']
lmtzr = WordNetLemmatizer()
lemmatized = [lmtzr.lemmatize(word) for word in a]
print(lemmatized)
This is what I've tried but it gives me the same sentences. Do I need to tokenize the words before to work properly?
TL;DR:
pip3 install -U pywsd
Then:
>>> from pywsd.utils import lemmatize_sentence
>>> text = 'i like cars'
>>> lemmatize_sentence(text)
['i', 'like', 'car']
>>> lemmatize_sentence(text, keepWordPOS=True)
(['i', 'like', 'cars'], ['i', 'like', 'car'], ['n', 'v', 'n'])
>>> text = 'The cat likes cars'
>>> lemmatize_sentence(text, keepWordPOS=True)
(['The', 'cat', 'likes', 'cars'], ['the', 'cat', 'like', 'car'], [None, 'n', 'v', 'n'])
>>> text = 'The lazy brown fox jumps, and the cat likes cars.'
>>> lemmatize_sentence(text)
['the', 'lazy', 'brown', 'fox', 'jump', ',', 'and', 'the', 'cat', 'like', 'car', '.']
Otherwise, take a look at how the function in pywsd:
Tokenize the string
Uses the POS tagger and maps to WordNet POS tagset
Attempts to stem
Finally calling the lemmatizer with the POS and/or stems
See https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L129
You must lemmatize each word separately. Instead, you lemmatize sentences. Correct code fragment:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize
sents = ['i like cars', 'cats are the best']
lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in word_tokenize(s)]
for s in sents]
print(lemmatized)
#[['i', 'like', 'car'], ['cat', 'are', 'the', 'best']]
You can also get better results if you first do POS tagging and then provide the POS information to the lemmatizer.
Related
I have a function:
def remove_stopwords(text):
return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]
My input is a list with a tokenized sentence:
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
Assume that stop_words contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:
desired_output = ['example', 'input']
However, the actual output that I'm getting now is:
actual_output = [[], [], [], ['example'], [], [], ['input']]
How can I adjust my code, to get this output?
There are two solutions to your problem:
Solution 1:
Your remove_stopwords requires an array of documents to work properly, so you modify your input like this
input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]
Solution 2:
You change your remove_stopwords function to work on a single document
def remove_stopwords(text):
return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]
You can use the below code for removing stopwords, if there is no specific reason to use your code.
wordsFiltered = []
def remove_stopwords(text):
for w in text:
if w not in stop_words:
wordsFiltered.append(w)
return wordsFiltered
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
stop_words = ['This', 'is', 'an', 'of', 'my']
print remove_stopwords(input)
Output:
['example', 'input']
I have sentences and I want to replace only its Pronoun, nouns and adjective with its corresponding POS Tags.
For example my sentence is :
"I am going to the most beautiful city, Islamabad"
and want the result
"PRP am VBG to the most JJ NN, NNP".
TL;DR
>>> from nltk import pos_tag, word_tokenize
>>> sent = "I am going to the most beautiful city, Islamabad"
>>> [pos if any(p for p in wanted_tags if pos.startswith(p)) else word for word, pos in pos_tag(word_tokenize(sent))]
['PRP', 'VBP', 'VBG', 'to', 'the', 'RBS', 'JJ', 'NN', ',', 'NNP']
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> [pos if any(p for p in wanted_tags if pos.startswith(p)) and word not in stoplist else word for word, pos in pos_tag(word_tokenize(sent))]
['PRP', 'am', 'VBG', 'to', 'the', 'most', 'JJ', 'NN', ',', 'NNP']
I'm processing text that I need to break up into a list of sentence tokens, which are themselves broken down into word tokens. For example:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
I also have a list of stopwords that I want to remove from the text:
stopwords = ['the', 'and', 'in']
I'm doing the list comprehension using the nltk module:
from nlkt import sent_tokenize, word_tokenize
sentence_tokens = [word_tokenize(sentence) for sentence in sent_tokenize(raw_text)]
This yields the following:
[['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
I can filter out the stopwords with nested for loops:
for sentences in sentence_tokens:
for word in sentences:
if word in stop:
sentences.remove(word)
What I'm having trouble doing is combining these all into a single list comprehension so it's a bit cleaner. Any advice? Thanks!
Make stopword a set, you can then use a list comp to filter out the words from each sublist that are in the set of stopwords:
stopwords = {'the', 'and', 'in'}
l = [['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
l[:] = [[word for word in sub if word not in stopwords] for sub in l]
Output:
[['cat', 'hat', '.'], ['green', 'eggs', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
Using l[:] means we will mutate the original object/list, if we broke it up into a for loop:
# for each sublist in l
for sub in l:
# for each word in the sublist, keep it only if it is not in stopwords
sub[:] = [word for word in sub if word not in stopwords]
Your own code also has a bug, you should never iterate over and mutate a list by removing elements, you would need to make a copy or we could also use reversed:
for sentences in l:
for word in reversed(sentences):
if word in stopwords:
sentences.remove(word)
When you remove an element starting from the left, you can end up removing the wrong elements as what a certain pointer was pointing to when the loop started may not be the same so on future removes you can remove the wrong element.
Tip: NLTK is not required for this task. A simple Python logic will do.
Here is the cleaner way to remove stop-words from text. I'm using Python 2.7 here.
When you want a string instead of list of words:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
stopwords = ['the', 'and', 'in']
clean_text = " ".join(word for word in raw_text.split() if word not in stopwords)
When you want a list of words:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
stopwords = ['the', 'and', 'in']
clean_list = [word for word in raw_text.split() if word not in stopwords]
I want to tokenize all the symbols of currency by using NLTK tokenize with regex.
For example this is my sentence:
The price of it is $5.00.
The price of it is RM5.00.
The price of it is €5.00.
I used this pattern of regex:
pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|\$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)
But as we can see it only considers $.
I tried to use \p{Sc} as explained in What is regex for currency symbol? but it is still not working for me.
Try pad the numbering with the currency symbol with spaces then tokenize:
>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is $5.00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
... numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
... for num in numbers_in_sent:
... sent = sent.replace(num, ' '+num+' ')
... print word_tokenize(sent)
...
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']
I want to lemmatize this text and it is only lemmatize the nouns i need to lemmatize the verbs also
>>> import nltk, re, string
>>> from nltk.stem import WordNetLemmatizer
>>> from urllib import urlopen
>>> url="https://raw.githubusercontent.com/evandrix/nltk_data/master/corpora/europarl_raw/english/ep-00-01-17.en"
>>> raw = urlopen(url).read()
>>> raw ="".join(l for l in raw if l not in string.punctuation)
>>> tokens=nltk.word_tokenize(raw)
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lem = [lemmatizer.lemmatize(t) for t in tokens]
>>> lem[:20]
['Resumption', 'of', 'the', 'session', 'I', 'declare', 'resumed', 'the', 'session', 'of', 'the', 'European', 'Parliament', 'adjourned', 'on', 'Friday', '17', 'December', '1999', 'and']
here verb like resumed it suppose to be resume can you tell me what i should to do for lemmatize the whole text
Using the pos parameter in wordnetlemmatizer:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('resumed')
'resumed'
>>> wnl.lemmatize('resumed', pos='v')
u'resume'
Here's a complete code, with pos_tag function:
>>> from nltk import word_tokenize, pos_tag
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> txt = """Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period ."""
>>> [wnl.lemmatize(i,j[0].lower()) if j[0].lower() in ['a','n','v'] else wnl.lemmatize(i) for i,j in pos_tag(word_tokenize(txt))]
['Resumption', 'of', 'the', 'session', 'I', 'declare', u'resume', 'the', 'session', 'of', 'the', 'European', 'Parliament', u'adjourn', 'on', 'Friday', '17', 'December', '1999', ',', 'and', 'I', 'would', 'like', 'once', 'again', 'to', 'wish', 'you', 'a', 'happy', 'new', 'year', 'in', 'the', 'hope', 'that', 'you', u'enjoy', 'a', 'pleasant', 'festive', 'period', '.']