Removing stop words that are not in NLTK library in python - python

I have been trying to remove stopwords from a csv file that are not found in the NLTK library, but when I generate the new dataframe an addition section that is supposed to be "cleaned", I still see some of those words in there, and I am not sure how to remove them. I am not sure what is wrong with my code but here it is:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus
import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(["consist", "feature", "site", "mound", "medium", "density", "enclosure"])
def clean_review(review_text):
# review_text = re.sub(r'http\S+','',review_text)
review_text = re.sub('[^a-zA-Z]',' ',str(review_text))
review_text = str(review_text).lower()
review_text = word_tokenize(review_text)
review_text = [word for word in review_text if word not in stop_words]
#review_text = [stemmer.stem(i) for i in review_text]
review_text = [lemma.lemmatize(word=w, pos='v') for w in review_text]
review_text = [i for i in review_text if len(i) > 2]
review_text = ' '.join(review_text)
return review_text
filename['New_Column']=filename['Column'].apply(clean_review)```

You are lemmatizing the text after removing the stopwords, which is OK sometimes.
But, you might have words that after lemmatizing it would be in your stopwords list
See the example
>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> print(lemmatizer.lemmatize("sites"))
site
>>>
At first, your script wouldn't remove sites, but after lemmatizing, it should.

Related

nltk.lemmatizer doesn't work for even a simple input text

Sorry guys I'm new to NLP and I'm trying to apply NLTK Lemmatizer to the whole input text, however it seems not to work for even a simple sentence.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
def word_lemmatizer(text):
lemmatizer = WordNetLemmatizer()
lem_text = "".join([lemmatizer.lemmatize(w) for w in text])
return lem_text
word_lemmatizer("bats flying balls")
[Output]:
'bats flying balls'
While lemmatizing, you need to pass one word at time from given text; in your code, characters are getting passed into the lemmatizer; So try this;
def word_lemmatizer(text):
lemmatizer = WordNetLemmatizer()
lem_text = " ".join([lemmatizer.lemmatize(w) for w in text.split(" ")]) # edited code here
return lem_text
word_lemmatizer("bats flying balls")
# Output;
'bat flying ball'

how can I remove suffix using the nltk package in python?

I changed "stem" to "suff", but it went wrong. Does anybody know how to use the suff fuction?
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import arlstem
for j in range(0 , len(contextsWorld_1)):
a = stemmer.stem(contextsWorld_1[j])
contextsWorld_1_1.append(a)
print(contextsWorld_1_1)
the website of nltk.stem package
the error
suff is a method of the arlstem module, which is designed for stemming Arabic text.
If the goal is to remove the suffixes of English words, we want to stem them with something like the English SnowballStemmer. Here is a minimal example:
from nltk.stem.snowball import SnowballStemmer
words = ["running", "jumping", "playing"]
stemmed_words = []
stemmer = SnowballStemmer(language="english")
for word in words:
a = stemmer.stem(word)
stemmed_words.append(a)
print(stemmed_words)
Output:
['run', 'jump', 'play']

How to convert token list into wordnet lemma list using nltk?

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:
['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]
There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:
['age', 'remember', 'hear', ...]
I am checking the synonyms through this code:
syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())
At this point I have created the function clean_text() in python for preprocessing. That looks like:
def clean_text(text):
# Eliminating punctuations
text = "".join([word for word in text if word not in string.punctuation])
# tokenizing
tokens = re.split("\W+", text)
# lemmatizing and removing stopwords
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
# converting token list into synset
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
return text
I am getting the error :
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'
How to get the token list for each lemma?
The full code:
import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Image
stopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()
data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:
pageData += page.extractText()
# print(pageData)
def clean_text(text):
text = "".join([word for word in text if word not in string.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]
return syns
print(clean_text(pageData))
You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word.
The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').
Below there is a functional version of clean_text with a fix of this problem:
import string
import re
import nltk
from nltk.corpus import wordnet
stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()
def clean_text(text):
text = "".join([word for word in text if word not in string.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
lemmas = []
for token in text:
lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
return lemmas
text = "The grass was greener."
print(clean_text(text))
Returns:
['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

Remove a stopwords from strings: why do I have an infinite loop?

I'm trying to apply a stopwords list in a text, but before I need to take some words off this list. The problem is that when I apply in my text, the result is an infinite loop
stop_words = set(stopwords.words('french'))
negation = ['n', 'pas', 'ne']
remove_words = [word for word in stop_words if word not in negation]
stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, remove_words)))
replace_stopwords = stopwords_regex.sub('', text)
print(replace_stopwords)
It's difficult to give an example because with one phrase it works, but with a collection of many strings, the stopwords are removed but the program never stops.
You can first tokenize your corpus using the nltk.RegexpTokenizer then remove your modified stopwords
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
content_french = ("John Richard Bond explique pas le rôle de l'astronomie.")
# initialize & apply tokenizer
toknizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
content_french_token = toknizer.tokenize(content_french)
# initialize & modify stopwords
stop_words = set(stopwords.words('french'))
negation = {'n', 'pas', 'ne'}
stop_words = set([wrd for wrd in stop_words if wrd not in negation])
# modify your text
content_french = " ".join([wrd for wrd in content_french_token if wrd not in stop_words])

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words.
I used NLTK to get a list of stop words:
from nltk.corpus import stopwords
stopwords.words('english')
Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
There's a very simple light-weight python package stop-words just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
Use textcleaner library to remove stopwords from your data.
Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
Follow these steps to do so with this library.
pip install textcleaner
After installing:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
Use above code to remove the stop-words.
Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called 'textfeatures'. You can use it as follows:
! pip install textfeatures
import textfeatures as tf
import pandas as pd
For example, suppose you have the following set of strings:
texts = [
"blue car and blue window",
"black crow in the window",
"i see my reflection in the window"]
df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df
Now, call the stopwords() function and pass the parameters you want:
tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns
The result is going to be:
text stopwords
0 blue car and blue window [and]
1 black crow in the window [in, the]
2 i see my reflection in the window [i, my, in, the]
As you can see, the last column has the stop words included in that docoument (record).
you can use this function, you should notice that you need to lower all the words
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
using filter:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
I will show you some example
First I extract the text data from the data frame (twitter_df) to process further as following
from nltk.tokenize import word_tokenize
tweetText = twitter_df['text']
Then to tokenize I use the following method
from nltk.tokenize import word_tokenize
tweetText = tweetText.apply(word_tokenize)
Then, to remove stop words,
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
tweetText.head()
I Think this will help you
In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

Categories

Resources