I have a dataset from which I would like to remove stop words.
I used NLTK to get a list of stop words:
from nltk.corpus import stopwords
stopwords.words('english')
Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
There's a very simple light-weight python package stop-words just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
Use textcleaner library to remove stopwords from your data.
Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
Follow these steps to do so with this library.
pip install textcleaner
After installing:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
Use above code to remove the stop-words.
Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called 'textfeatures'. You can use it as follows:
! pip install textfeatures
import textfeatures as tf
import pandas as pd
For example, suppose you have the following set of strings:
texts = [
"blue car and blue window",
"black crow in the window",
"i see my reflection in the window"]
df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df
Now, call the stopwords() function and pass the parameters you want:
tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns
The result is going to be:
text stopwords
0 blue car and blue window [and]
1 black crow in the window [in, the]
2 i see my reflection in the window [i, my, in, the]
As you can see, the last column has the stop words included in that docoument (record).
you can use this function, you should notice that you need to lower all the words
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
using filter:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
I will show you some example
First I extract the text data from the data frame (twitter_df) to process further as following
from nltk.tokenize import word_tokenize
tweetText = twitter_df['text']
Then to tokenize I use the following method
from nltk.tokenize import word_tokenize
tweetText = tweetText.apply(word_tokenize)
Then, to remove stop words,
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
tweetText.head()
I Think this will help you
In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
Related
Hi stackoverflow community!
Long-time reader but first-time poster. I'm currently trying my hand at NLP and after reading a few forum posts touching upon this topic, I can't seem to get the lemmatizer to work properly (function pasted below). Comparing my original text vs preprocessed text, all the cleaning steps work as expected, except the lemmatization. I've even tried specifying the part of speech : 'v' to not default the word as noun, and still get the base form of the verb (ex: turned -> turn , are -> be, reading -> read) ... however this doesn't seem to be working.
Appreciate another set of eyes and feedback - thanks!
# key imports
import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions
# cleaning functions
def to_lower(text):
'''
Convert text to lowercase
'''
return text.lower()
def remove_punct(text):
return ''.join(c for c in text if c not in punctuation)
def remove_stopwords(text):
'''
Removes stop words which don't have meaning (ex: is, the, a, etc.)
'''
additional_stopwords = ['app']
stop_words = set(stopwords.words('english')) - set(['not','out','in'])
stop_words = stop_words.union(additional_stopwords)
return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])
def fix_contractions(text):
'''
Expands contractions
'''
return contractions.fix(text)
# preprocessing pipeline
def preprocess(text):
# convert to lower case
lower_text = to_lower(text)
sentence_tokens = sent_tokenize(lower_text)
word_list = []
for each_sent in sentence_tokens:
# fix contractions
clean_text = fix_contractions(each_sent)
# remove punctuation
clean_text = remove_punct(clean_text)
# filter out stop words
clean_text = remove_stopwords(clean_text)
# get base form of word
wnl = WordNetLemmatizer()
for part_of_speech in ['v']:
lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
# split the sentence into word tokens
word_tokens = word_tokenize(lemmatized_word)
for i in word_tokens:
word_list.append(i)
return word_list
# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'
sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)
This program is to find similarities between the a sentences and words
and how they are similar in synonyms
I have downloaded the nltk
when i first coded it was run and there were no errors but after some days when i run the program ti give me this error AttributeError: 'list' object has no attribute '_all_hypernyms'
the error is because of this wn.wup_similarity
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
database=[]
main_sen=[]
words=[]
range_is=[0.78]
word_check=[0.1]
main_sentence="the world in ending with the time is called hello"
database_word=["known","complete"]
stopwords = stopwords.words('english')
words = word_tokenize(main_sentence)
filtered_sentences = []
for word in words:
if word not in stopwords:
filtered_sentences.append(word)
print (filtered_sentences)
for databasewords in database_word:
database.append(wn.synsets(databasewords)[0])
print(database)
for sentences in filtered_sentences:
main_sen.append(wn.synsets(sentences))
print(main_sen)
# Error is in below lines
for data in database:
for sen in main_sen :
word_check.append(wn.wup_similarity(data,sen))
if word_check >range_is:
count = +1
print(count)
so I am not sure of what you are trying to acheive with this code but I see why the code is breaking, what is happening is that in the line that is breaking:
word_check.append(wn.wup_similarity(data[0], sen[0]))
You are comparing two lists of synsets, this is problematic because the function cannot handle list of synsets.
So if you want to compare all the synsets that are created you will need to use an additional for loop to extract all the synset in the list.
However, if you want to compare the first synset of the first word with the first synset of the secon word you can simply do this
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
database=[]
main_sen=[]
words=[]
range_is=[0.78]
word_check=[0.1]
main_sentence="the world in ending with the time is called hello"
database_word=["known","complete"]
stopwords = stopwords.words('english')
words = word_tokenize(main_sentence)
filtered_sentences = []
for word in words:
if word not in stopwords:
filtered_sentences.append(word)
print (filtered_sentences)
for databasewords in database_word:
database.append(wn.synsets(databasewords)[0])
print(database)
for sentences in filtered_sentences:
main_sen.append(wn.synsets(sentences)[0])
print(main_sen)
# Error is in below lines
for data in database:
for sen in main_sen :
word_check.append(wn.wup_similarity(data,sen))
if word_check >range_is:
count = +1
print(count)
Also you need to be careful in the if statement of the word_check > range_is becasue you are comparing a list with a value, while you should check the similaryty outcome itself
if wn.wup_similarity(data, sen) > range_is:
count = +1
I'm trying to apply a stopwords list in a text, but before I need to take some words off this list. The problem is that when I apply in my text, the result is an infinite loop
stop_words = set(stopwords.words('french'))
negation = ['n', 'pas', 'ne']
remove_words = [word for word in stop_words if word not in negation]
stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, remove_words)))
replace_stopwords = stopwords_regex.sub('', text)
print(replace_stopwords)
It's difficult to give an example because with one phrase it works, but with a collection of many strings, the stopwords are removed but the program never stops.
You can first tokenize your corpus using the nltk.RegexpTokenizer then remove your modified stopwords
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
content_french = ("John Richard Bond explique pas le rĂ´le de l'astronomie.")
# initialize & apply tokenizer
toknizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
content_french_token = toknizer.tokenize(content_french)
# initialize & modify stopwords
stop_words = set(stopwords.words('french'))
negation = {'n', 'pas', 'ne'}
stop_words = set([wrd for wrd in stop_words if wrd not in negation])
# modify your text
content_french = " ".join([wrd for wrd in content_french_token if wrd not in stop_words])
I am preprocessing a text and want to remove common stopwords in german. This works almost fine with the following code [final_wordlist as example data]:
from nltk.corpus import stopwords
final_wordlist =['Status', 'laufende', 'Projekte', 'bei', 'Stand', 'Ende', 'diese', 'Bei']
stopwords_ger = stopwords.words('german')
filtered_words = [w for w in final_wordlist if w not in stopwords_ger]
print(filtered_words)
This yields:
['Status', 'laufende', 'Projekte', 'Stand', 'Ende', 'Bei']
But as you can see, the upper case 'Bei' is not removed (as it should) as the stopwords from nltk are all lower case. Is there a easy way to remove all stopwords caseinsensitively?
Try this : filtered_words = [w for w in final_wordlist if w.lower() not in stopwords_ger]
I am new to python and am just trying to write a simple program that would stem the words in a string and output an array/list that contains the stemmed words but I can seem to get them into a single array. This is the code i have and I included the output. Thanks in advance for any help!
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
new_text = "My two friends are getting married tomorrow and I could not be
more excited for them"
words = word_tokenize(new_text)
for w in words:
stems = [ps.stem(w)]
print(stems)
My output:
['My']
['two']
['friend']
['are']
['get']
['marri']
['tomorrow']
['and']
['I']
['could']
['not']
['be']
['more']
['excit']
['for']
['them']
You can use a list comprehension instead.
print([ps.stem(w) for w in words])