I am new to Python nltk
Current, I have a program that does word_tokenize from a sentence. The word_tokenize is then processed that corrects some capitalization to some nouns. This process works fine, now I want to convert the processed word_tokenize to a sentence again. I can easily do this by a loop and for every display, I just need to add space. But there will be cases that this will not work for words like "it's, I'm, don't and etc." because word_tokenize save those words separately. Doing so my processed word_tokenize will be converted to "it 's, I 'm, don 't and etc."
Is there a function of nltk that does the word_tokenize to sentence perfectly?
nltk has TreebankWordDetokenizer which can reconstruct a sentence from a list of tokens:
from nltk import word_tokenize
tokens = word_tokenize("I'm happy because it's a good book")
print(tokens)
#['I', "'m", 'happy', 'because', 'it', "'s", 'a', 'good', 'book']
from nltk.tokenize.treebank import TreebankWordDetokenizer
reconstructedSentence = TreebankWordDetokenizer().detokenize(tokens)
print(reconstructedSentence)
#I'm happy because it's a good book
Related
This is the code. I want to remove all the stopwords from the sentence. I still get the word 'i'.
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
en_stops=set(stopwords)
x='I am a good boy. I always pay by debts'
[item.lower().rstrip() for item in x.split() if item not in en_stops]
Output I get:
['i', 'good', 'boy.', 'i', 'always', 'pay', 'debts']
NLTK stop words are all lowercase. So, you need to convert your words to lowercase as well, before doing the membership check. You can change the last line of your code snippet to make it work:
[item.rstrip() for item in x.lower().split() if item not in en_stops]
Update:
As suggested in the comments, for more robustness we can use the in-built tokenizers instead of string.split() to take care of punctuations. In that case the code snippet would look something like this:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
stopwords = stopwords.words('english')
en_stops=set(stopwords)
x = 'I am a good boy. I always pay by debts'
tokenized_sentences = list()
exclusion_set = en_stops.union(string.punctuation)
for sent in sent_tokenize(x):
tokenized_sentences.append([word for word in word_tokenize(sent.lower()) if word not in exclusion_set])
The tokenized senteces would look like this:
[['good', 'boy'], ['always', 'pay', 'debts']]
I've reviewed the official documentation of the text_to_word_sequence method in Keras here
The code listed in the documentation is:
keras.preprocessing.text.text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n', lower=True, split=' ')
I was wondering if there is a way to add something in the filters parameter that would also remove stopwords (such as from the nltk list of stopwords) i.e.
from nltk.corpus import stopwords
stopwords.words('English')
I am aware that we can also remove stopwords via regular expressions (using a _sre.SRE_Pattern) as below:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('English')) + r')\b\s*')
phrase = pattern.sub('', phrase)
My minimum verifiable example is:
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
text_to_word_sequence("The cat is in the hat!!!")
Output: ['the', 'cat', 'is', 'in', 'the', 'hat']
I would like the output to be :
['cat', 'hat']
My question is this:
Is there a way using the filters parameter in the text_to_word_sequence method to automatically filter out stopwords along with the special characters that it filters out by default? Such as by using a pattern (_sre.SRE_Pattern) etc?
This question already has answers here:
How to prevent splitting specific words or phrases and numbers in NLTK?
(2 answers)
Closed 3 years ago.
A string can be tokenized by removing some unnecessary stopwords using nltk.tokenize. But how can I tokenize a phrase containing stopwords as a single token, while removing other stopwords?
For example:
Input: Trump is the President of the United States.
Output: ['Trump','President of the United States']
How can I get the result that just removes 'is' and first 'the' but doesn't remove 'of' and second 'the'?
You can use nltk's Multi-Word Expression Tokenizer which allows to merge multi-word expressions into single tokens. You can create a lexicon of multi-word expressions and add entries to it like this:
from nltk.tokenize import MWETokenizer
mwetokenizer = MWETokenizer([('President','of','the','United','States')], separator=' ')
mwetokenizer.add_mwe(('President','of','France'))
Note that MWETokenizer takes a list of tokenized text as input, and re-tokenizes it. So, first tokenize the sentence eg. with word_tokenize(), and then feed it into the MWETokenizer:
from nltk.tokenize import word_tokenize
sentence = "Trump is the President of the United States, and Macron is the President of France."
mwetokenized_sentence = mwetokenizer.tokenize(word_tokenize(sentence))
# ['Trump', 'is', 'the', 'President of the United States', ',', 'and', 'Macron', 'is', 'the', 'President of France', '.']
Then, filter out stop-words to get the final filtered tokenized sentence:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_sentence = [token for token in mwetokenizer.tokenize(word_tokenize(sentence)) if token not in stop_words]
print(filtered_sentence)
Output:
['Trump', 'President of the United States', ',', 'Macron', 'President of France', '.']
I'm attempting to make a counter which uses a list of POS trigrams to check to a large list of trigrams and find their frequency.
My code so far is as follows:
from nltk import trigrams
from nltk.tokenize import wordpunct_tokenize
from nltk import bigrams
from collections import Counter
import nltk
text= ["This is an example sentence."]
trigram_top= ['PRP', 'MD', 'VB']
for words in text:
tokens = wordpunct_tokenize (words)
tags = nltk.pos_tag (tokens)
trigram_list=trigrams(tags)
list_tri=Counter (t for t in trigram_list if t in trigram_top)
print list_tri
I get an empty counter back. How do I mend this?
In an earlier version I did get data back, but it kept counting up for ever iteration (in the real program, text is a collection of different files).
Does anyone have an idea?
Let's put some print in there to debug:
from nltk import trigrams
from nltk.tokenize import wordpunct_tokenize
from nltk import bigrams
from collections import Counter
import nltk
text= ["This is an example sentence."]
trigram_top= ['PRP', 'MD', 'VB']
for words in text:
tokens = wordpunct_tokenize (words)
print tokens
tags = nltk.pos_tag (tokens)
print tags
list_tri=Counter (t[0] for t in tags if t[1] in trigram_top)
print list_tri
#['This', 'is', 'an', 'example', 'sentence', '.']
#[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('sentence', 'NN'), ('.', '.')]
#Counter()
Note that the list= part was redundant and I've changed the generator to just take the word instead of the pos tag
We can see that none of the pos tags directly match your trigram_top - you may want to amend your comparison check to cater for VB/VBZ...
A possibility would be changing the line:
list_tri=Counter (t[0] for t in tags if t[1].startswith(tuple(trigram_top)))
# Counter({'is': 1})
I have a dataset from which I would like to remove stop words.
I used NLTK to get a list of stop words:
from nltk.corpus import stopwords
stopwords.words('english')
Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
There's a very simple light-weight python package stop-words just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
Use textcleaner library to remove stopwords from your data.
Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
Follow these steps to do so with this library.
pip install textcleaner
After installing:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
Use above code to remove the stop-words.
Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called 'textfeatures'. You can use it as follows:
! pip install textfeatures
import textfeatures as tf
import pandas as pd
For example, suppose you have the following set of strings:
texts = [
"blue car and blue window",
"black crow in the window",
"i see my reflection in the window"]
df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df
Now, call the stopwords() function and pass the parameters you want:
tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns
The result is going to be:
text stopwords
0 blue car and blue window [and]
1 black crow in the window [in, the]
2 i see my reflection in the window [i, my, in, the]
As you can see, the last column has the stop words included in that docoument (record).
you can use this function, you should notice that you need to lower all the words
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
using filter:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
I will show you some example
First I extract the text data from the data frame (twitter_df) to process further as following
from nltk.tokenize import word_tokenize
tweetText = twitter_df['text']
Then to tokenize I use the following method
from nltk.tokenize import word_tokenize
tweetText = tweetText.apply(word_tokenize)
Then, to remove stop words,
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
tweetText.head()
I Think this will help you
In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])