Iterate and Lemmatize List - python

I'm a newbie and struggling with what I'm sure is a simple task.
I have a list of words taken from POS tagging:
words = ['drink', 'drinking']
And I want to lemmatize them and then process them (using set?) to ultimately refine my list to:
refined_list = ['drink']
However, I"m stuck on the next step of lemmatization - my method still returns the following:
refinded_list = ['drink', 'drinking']
I tried to reference this but can't figure out what to import so 'lmtzr' works or how to get it to work.
Here's my code so far:
import nltk
words = ['drink', 'drinking']
WNlemma = nltk.WordNetLemmatizer()
refined_list = [WNlemma.lemmatize(t) for t in words]
print(refined_list)
Thank you for helping me.

You need to set pos tag parameter from lemmatize as VERB. By default it is NOUN.
So it considers everything as NOUN even if you pass the VERB.
import nltk
words = ['drink', 'drinking']
WNlemma = nltk.WordNetLemmatizer()
refined_list = [WNlemma.lemmatize(t, pos='v') for t in words]
print(refined_list)
Output:
['drink', 'drink']

Related

word tokenization takes too much time to run

I use Pythainlp package to tokenize my Thai language data for doing sentiment analysis.
first, I build a function to add new words set and tokenize it
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize
def text_tokenize(Mention):
new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
words = new_words.union(thai_words())
custom_dictionary_trie = dict_trie(words)
dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
return dataa
after that I apply it within my text_process function which including remove punctuation and stop words.
puncuations = '''.?!,;:-_[]()'/<>{}\##$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
final = text_tokenize(final)
final = " ".join(word for word in final)
final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
return final
dff['text_tokens'] = dff['Mention'].apply(text_process)
dff
the point is it takes too long to run this function. it took 17 minutes and still not finished. I tried to replace
final = text_tokenize(final) with final = word_tokenize(final)
and it took just 2 minutes but I can't no longer use it because I need to add new custom dictionary. I know there is something wrong but really don't know how to fix it
I am new to python and nlp so please help.
Ps. sorry for my broken English
I am not familiar with Thai language, but assume that for tokenization you can also use language agnostic tokenization tools.
If you want to perform word tokenization, try the example below:
from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)
>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']

How to get n-grams from a column in pandas dataframe

I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
Sentences
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
lenghts=range(min_length,max_length+1)
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?
Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams

Built-in function to get the frequency of one word with spaCy?

I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
print(lemma_word)
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)
Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
print(freq)
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
print(freq['Lorem'])
>>> 4
Alright just to give some time reference, I have used this script,
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
Results,
149.01646934738716
foo: 9998872
18.093295297389773
Still I am not sure if this is enough for you.
Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.
As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.
It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.
Solution
First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.
import spacy
from spacy.attrs import ORTH
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple apple orange banana")
Then we create a dictionary of word counts
count_dict = doc.count_by(ORTH)
You could count by other attributes like LEMMA, just import the attribute you wish to use.
If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.
count_dict
Results:
{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}
We can get the text for the word if we look up the hash in the vocab.
nlp.vocab.strings[8566208034543834098]
Returns
'apple'
With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.
def get_word_count(word, count_dict):
return count_dict[nlp.vocab.strings[word]]
If we run the function with our search word 'apple' and the count dict we created earlier
get_word_count('apple', count_dict)
We get:
2
https://spacy.io/api/doc#count_by

Replacement by synsets in Python pattern packatge

My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?
Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
print(s.synonyms)
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
print(s.hypernyms())
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
print(s.hypernyms()[0].synonyms)
Out[17]: [u'canine', u'canid']

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words.
I used NLTK to get a list of stop words:
from nltk.corpus import stopwords
stopwords.words('english')
Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
There's a very simple light-weight python package stop-words just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
Use textcleaner library to remove stopwords from your data.
Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
Follow these steps to do so with this library.
pip install textcleaner
After installing:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
Use above code to remove the stop-words.
Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called 'textfeatures'. You can use it as follows:
! pip install textfeatures
import textfeatures as tf
import pandas as pd
For example, suppose you have the following set of strings:
texts = [
"blue car and blue window",
"black crow in the window",
"i see my reflection in the window"]
df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df
Now, call the stopwords() function and pass the parameters you want:
tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns
The result is going to be:
text stopwords
0 blue car and blue window [and]
1 black crow in the window [in, the]
2 i see my reflection in the window [i, my, in, the]
As you can see, the last column has the stop words included in that docoument (record).
you can use this function, you should notice that you need to lower all the words
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
using filter:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
I will show you some example
First I extract the text data from the data frame (twitter_df) to process further as following
from nltk.tokenize import word_tokenize
tweetText = twitter_df['text']
Then to tokenize I use the following method
from nltk.tokenize import word_tokenize
tweetText = tweetText.apply(word_tokenize)
Then, to remove stop words,
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
tweetText.head()
I Think this will help you
In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

Categories

Resources