How to lemmatize a list of sentences - python

How can I lemmatize a list of sentences in Python?
from nltk.stem.wordnet import WordNetLemmatizer
a = ['i like cars', 'cats are the best']
lmtzr = WordNetLemmatizer()
lemmatized = [lmtzr.lemmatize(word) for word in a]
print(lemmatized)
This is what I've tried but it gives me the same sentences. Do I need to tokenize the words before to work properly?

TL;DR:
pip3 install -U pywsd
Then:
>>> from pywsd.utils import lemmatize_sentence
>>> text = 'i like cars'
>>> lemmatize_sentence(text)
['i', 'like', 'car']
>>> lemmatize_sentence(text, keepWordPOS=True)
(['i', 'like', 'cars'], ['i', 'like', 'car'], ['n', 'v', 'n'])
>>> text = 'The cat likes cars'
>>> lemmatize_sentence(text, keepWordPOS=True)
(['The', 'cat', 'likes', 'cars'], ['the', 'cat', 'like', 'car'], [None, 'n', 'v', 'n'])
>>> text = 'The lazy brown fox jumps, and the cat likes cars.'
>>> lemmatize_sentence(text)
['the', 'lazy', 'brown', 'fox', 'jump', ',', 'and', 'the', 'cat', 'like', 'car', '.']
Otherwise, take a look at how the function in pywsd:
Tokenize the string
Uses the POS tagger and maps to WordNet POS tagset
Attempts to stem
Finally calling the lemmatizer with the POS and/or stems
See https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L129

You must lemmatize each word separately. Instead, you lemmatize sentences. Correct code fragment:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize
sents = ['i like cars', 'cats are the best']
lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in word_tokenize(s)]
for s in sents]
print(lemmatized)
#[['i', 'like', 'car'], ['cat', 'are', 'the', 'best']]
You can also get better results if you first do POS tagging and then provide the POS information to the lemmatizer.

Related

How to only return actual tokens, rather than empty variables when tokenizing?

I have a function:
def remove_stopwords(text):
return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]
My input is a list with a tokenized sentence:
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
Assume that stop_words contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:
desired_output = ['example', 'input']
However, the actual output that I'm getting now is:
actual_output = [[], [], [], ['example'], [], [], ['input']]
How can I adjust my code, to get this output?
There are two solutions to your problem:
Solution 1:
Your remove_stopwords requires an array of documents to work properly, so you modify your input like this
input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]
Solution 2:
You change your remove_stopwords function to work on a single document
def remove_stopwords(text):
return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]
You can use the below code for removing stopwords, if there is no specific reason to use your code.
wordsFiltered = []
def remove_stopwords(text):
for w in text:
if w not in stop_words:
wordsFiltered.append(w)
return wordsFiltered
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
stop_words = ['This', 'is', 'an', 'of', 'my']
print remove_stopwords(input)
Output:
['example', 'input']

Replacing only pronoun, noun, verb and adjective in a sentence with its corresponding tags, how could I do it efficiently in Python?

I have sentences and I want to replace only its Pronoun, nouns and adjective with its corresponding POS Tags.
For example my sentence is :
"I am going to the most beautiful city, Islamabad"
and want the result
"PRP am VBG to the most JJ NN, NNP".
TL;DR
>>> from nltk import pos_tag, word_tokenize
>>> sent = "I am going to the most beautiful city, Islamabad"
>>> [pos if any(p for p in wanted_tags if pos.startswith(p)) else word for word, pos in pos_tag(word_tokenize(sent))]
['PRP', 'VBP', 'VBG', 'to', 'the', 'RBS', 'JJ', 'NN', ',', 'NNP']
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words('english')
>>> [pos if any(p for p in wanted_tags if pos.startswith(p)) and word not in stoplist else word for word, pos in pos_tag(word_tokenize(sent))]
['PRP', 'am', 'VBG', 'to', 'the', 'most', 'JJ', 'NN', ',', 'NNP']

Python: how to do this list comprehension with multiple nested lists?

I'm processing text that I need to break up into a list of sentence tokens, which are themselves broken down into word tokens. For example:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
I also have a list of stopwords that I want to remove from the text:
stopwords = ['the', 'and', 'in']
I'm doing the list comprehension using the nltk module:
from nlkt import sent_tokenize, word_tokenize
sentence_tokens = [word_tokenize(sentence) for sentence in sent_tokenize(raw_text)]
This yields the following:
[['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
I can filter out the stopwords with nested for loops:
for sentences in sentence_tokens:
for word in sentences:
if word in stop:
sentences.remove(word)
What I'm having trouble doing is combining these all into a single list comprehension so it's a bit cleaner. Any advice? Thanks!
Make stopword a set, you can then use a list comp to filter out the words from each sublist that are in the set of stopwords:
stopwords = {'the', 'and', 'in'}
l = [['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
l[:] = [[word for word in sub if word not in stopwords] for sub in l]
Output:
[['cat', 'hat', '.'], ['green', 'eggs', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
Using l[:] means we will mutate the original object/list, if we broke it up into a for loop:
# for each sublist in l
for sub in l:
# for each word in the sublist, keep it only if it is not in stopwords
sub[:] = [word for word in sub if word not in stopwords]
Your own code also has a bug, you should never iterate over and mutate a list by removing elements, you would need to make a copy or we could also use reversed:
for sentences in l:
for word in reversed(sentences):
if word in stopwords:
sentences.remove(word)
When you remove an element starting from the left, you can end up removing the wrong elements as what a certain pointer was pointing to when the loop started may not be the same so on future removes you can remove the wrong element.
Tip: NLTK is not required for this task. A simple Python logic will do.
Here is the cleaner way to remove stop-words from text. I'm using Python 2.7 here.
When you want a string instead of list of words:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
stopwords = ['the', 'and', 'in']
clean_text = " ".join(word for word in raw_text.split() if word not in stopwords)
When you want a list of words:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
stopwords = ['the', 'and', 'in']
clean_list = [word for word in raw_text.split() if word not in stopwords]

How to tokenize all currency symbols using Regex in python?

I want to tokenize all the symbols of currency by using NLTK tokenize with regex.
For example this is my sentence:
The price of it is $5.00.
The price of it is RM5.00.
The price of it is €5.00.
I used this pattern of regex:
pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|\$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)
But as we can see it only considers $.
I tried to use \p{Sc} as explained in What is regex for currency symbol? but it is still not working for me.
Try pad the numbering with the currency symbol with spaces then tokenize:
>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is $5.00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
... numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
... for num in numbers_in_sent:
... sent = sent.replace(num, ' '+num+' ')
... print word_tokenize(sent)
...
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']

Python NLTK: how to lemmatize text include verb in english?

I want to lemmatize this text and it is only lemmatize the nouns i need to lemmatize the verbs also
>>> import nltk, re, string
>>> from nltk.stem import WordNetLemmatizer
>>> from urllib import urlopen
>>> url="https://raw.githubusercontent.com/evandrix/nltk_data/master/corpora/europarl_raw/english/ep-00-01-17.en"
>>> raw = urlopen(url).read()
>>> raw ="".join(l for l in raw if l not in string.punctuation)
>>> tokens=nltk.word_tokenize(raw)
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lem = [lemmatizer.lemmatize(t) for t in tokens]
>>> lem[:20]
['Resumption', 'of', 'the', 'session', 'I', 'declare', 'resumed', 'the', 'session', 'of', 'the', 'European', 'Parliament', 'adjourned', 'on', 'Friday', '17', 'December', '1999', 'and']
here verb like resumed it suppose to be resume can you tell me what i should to do for lemmatize the whole text
Using the pos parameter in wordnetlemmatizer:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('resumed')
'resumed'
>>> wnl.lemmatize('resumed', pos='v')
u'resume'
Here's a complete code, with pos_tag function:
>>> from nltk import word_tokenize, pos_tag
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> txt = """Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period ."""
>>> [wnl.lemmatize(i,j[0].lower()) if j[0].lower() in ['a','n','v'] else wnl.lemmatize(i) for i,j in pos_tag(word_tokenize(txt))]
['Resumption', 'of', 'the', 'session', 'I', 'declare', u'resume', 'the', 'session', 'of', 'the', 'European', 'Parliament', u'adjourn', 'on', 'Friday', '17', 'December', '1999', ',', 'and', 'I', 'would', 'like', 'once', 'again', 'to', 'wish', 'you', 'a', 'happy', 'new', 'year', 'in', 'the', 'hope', 'that', 'you', u'enjoy', 'a', 'pleasant', 'festive', 'period', '.']

Categories

Resources