Trying to remove my own stopwords using join/split - python

I am trying to remove stopwords in df['Sentences'] as I would need to plot it.
My sample is
... 13
London 12
holiday 11
photo 7
. 7
..
walk 1
dogs 1
I have built my own dictionary and I would like to use it to remove the stop_words from that list.
What I have done is as follows:
import matplotlib.pyplot as plt
df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()
Although it does not give me any error, the stopwords and punctuation are still there. Also, I would like to not alterate the column, but just looking at the results for a short analysis (for example, creating a copy of the original column).
How could I remove them?

Let's assume you have this dataframe with this really interesting conversation.
df = pd.DataFrame({'Sentences':['Hello, how are you?',
'Hello, I am fine. Have you watched the news',
'Not really the news ...']})
print (df)
Sentences
0 Hello, how are you?
1 Hello, I am fine. Have you watched the news
2 Not really the news ...
Now you want to remove the punctuation and the stopwords from my_dict, you can do it like this
my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^\w\s]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news 2
hello 2
watched 1
fine 1
not 1
really 1
how 1
1
Name: Sentences, dtype: int64
This might not be the faster way though

Related

Most efficient way to remove list elements from column

I have a large DataFrame df with 40k+ rows:
words
0 coffee in the morning
1 good morning
2 hello my name is
3 hello world
...
I have a list of English stopwords that I want to remove from the column, which I do as follows:
df["noStopwords"] = df["words"].apply(lambda x: [i for i in x if i not in stopwords])
noStopwords
0 coffee morning
1 good morning
2 hello name
3 hello world
...
This works but takes too long. Is there an efficient way of doing this?
EDIT:
print(stopwords)
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', ...]
Not sure if it's any faster, but you could try str.replace. Since words is actually a list, we also have to join and split as well.
import re
pattern = rf"\b(?:{'|'.join(map(re.escape, stopwords))})\b"
df['noStopwords'] = (df.words
.str.join(' ')
.str.replace(pattern, '', regex=True)
.str.split('\s+')
)

Find matching similar keywords in Python Dataframe

joined_Gravity1.head()
Comments
____________________________________________________
0 Why the old Pike/Lyrik?
1 This is good
2 So clean
3 Looks like a Decoy
Input: type(joined_Gravity1)
Output: pandas.core.frame.DataFrame
The following code allows me to select strings that contain keywords: "ender"
joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]
Output:
Comments
___________________________
194 We need a new Sender 😂
7 What about the sender
179 what about the sender?😏
How to revise the code to include words similar to 'Sender' such as 'snder','bnder'?
I don't see a reason why regex=True inside the contains function won't work here.
joined_Gravity1[joined_Gravity1["Comments"].str.contains(pat="ender|snder|bndr", na=False, regex=True)]
I have used "ender|snder|bnder" only. You can make a list of all such words say list_words, and pass in pat='|'.join(list_words) in contains function above.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
There can be a massive number of possibilities that can occur with combinations of alphabets in such words. What you are trying to do is a fuzzy match between 2 string. I can recommend using the following -
#!pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
word = 'sender'
others = ['bnder', 'snder', 'sender', 'hello']
process.extractBests(word, others)
[('sender', 100), ('snder', 91), ('bnder', 73), ('hello', 18)]
Based on this you can decide which threshold to choose and then mark the ones that are above the threshold as a match (using the code you used above)
Here is a method to do this in your exact problem statement with a function -
df = pd.DataFrame(['hi there i am a sender',
'I dont wanna be a bnder',
'can i be the snder?',
'i think i am a nerd'], columns=['text'])
#s = sentence, w = match word, t = match threshold
def get_match(s,w,t):
ss = process.extractBests(w,s.split())
return any([i[1]>t for i in ss])
#What its doing - Match each word in each row in df.text with
#the word sender and see of any of the words have a match greater
#than threshold ratio 70.
df['match'] = df['text'].apply(get_match, w='sender', t=70)
print(df)
text match
0 hi there i am a sender True
1 I dont wanna be a bnder True
2 can i be the snder? True
3 i think i am a nerd False
Tweek the t value from 70 to 80 if you want more exact match or lower for more relaxed match.
Finally you can filter it out -
df[df['match']==True][['text']]
text
0 hi there i am a sender
1 I dont wanna be a bnder
2 can i be the snder?
from difflib import get_close_matches
def closeMatches(patterns, word):
print(get_close_matches(word, patterns))
list_patterns = joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]
word = 'Sender'
patterns = list_patterns
closeMatches(patterns, word)

Python Pandas handle special characters in strings

I write a function which I want to apply to a dataframe later.
def get_word_count(text,df):
#text is a lowercase list of words
#df is a dataframe with 2 columns: word and count
#this function updates the word counts
#f=open('stopwords.txt','r')
#stopwords=f.read()
stopwords='in the and an - '
for word in text:
if word not in stopwords:
if df['word'].str.contains(word).any():
df.loc[df['word']==word, 'count']=df['count']+1
else:
df.loc[0]=[word,1]
df.index=df.index+1
return df
Then I check it:
word_df=pd.DataFrame(columns=['word','count'])
sentence1='[first] - missing "" in the text [first] word'.split()
y=get_word_count(sentence1, word_df)
sentence2="error: wrong word in the [second] text".split()
y=get_word_count(sentence2, word_df)
y
I get the following results:
Word Count
[first] 2
missing 1
"" 1
text 2
word 2
error: 1
wrong 1
So where is [second] from the sentence2?
If I omit one of square brackets I get an error message. How do I handle words with special characters? Note that I don't want to get rid of them because if I do, I will miss "" in the sentence1.
The problem comes from the line:
if df['word'].str.contains(word).any():
This reports if any of the words in the word column contains the given word. The DataFrame from df['word'].str.contains(word) reports True when [second] is given and compared to specifically [first].
For a quick fix, I changed the line to:
if word in df['word'].tolist():
Creating a DataFrame in a loop like that is not recommended, you should do something like this:
stopwords='in the and an - '
sentence = sentence1+sentence2
df = pd.DataFrame([sentence.split()]).T
df.rename(columns={0: 'Words'}, inplace=True)
df = df.groupby(by=['Words'])['Words'].size().reset_index(name='counts')
df = df[~df['Words'].isin(stopwords.split())]
print(df)
Words counts
0 "" 1
2 [first] 2
3 [second] 1
4 error: 1
6 missing 1
7 text 2
9 word 2
10 wrong 1
I have rebuild it in a way you can add sentences and see the frequency growing
from collections import Counter
from collections import defaultdict
import pandas as pd
def terms_frequency(corpus, stop_words=None):
'''
Takes in texts and returns a pandas DataFrame of words frequency
'''
corpus_ = corpus.split()
# remove stop wors
terms = [word for word in corpus_ if word not in stop_words]
terms_freq = pd.DataFrame.from_dict(Counter(terms), orient='index').reset_index()
terms_freq = terms_freq.rename(columns={'index':'word', 0:'count'}).sort_values('count',ascending=False)
terms_freq.reset_index(inplace=True)
terms_freq.drop('index',axis=1,inplace=True)
return terms_freq
def get_sentence(sentence, storage, stop_words=None):
storage['sentences'].append(sentence)
corpus = ' '.join(s for s in storage['sentences'])
return terms_frequency(corpus,stop_words)
# tests
STOP_WORDS = 'in the and an - '
storage = defaultdict(list)
S1 = '[first] - missing "" in the text [first] word'
print(get_sentence(S1,storage,STOP_WORDS))
print('\nNext S2')
S2 = 'error: wrong word in the [second] text'
print(get_sentence(S2,storage,STOP_WORDS))

Stopword removal with pandas

I would like to remove stopwords from a column of a data frame.
Inside the column there is text which needs to be splitted.
For example my data frame looks like this:
ID Text
1 eat launch with me
2 go outside have fun
I want to apply stopword on text column so it should be splitted.
I tried this:
for item in cached_stop_words:
if item in df_from_each_file[['text']]:
print(item)
df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')
So my output should be like this:
ID Text
1 eat launch
2 go fun
It means stopwords have been deleted.
but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.
Thanks for your help.
replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.
One simple solution, when you have a manageable number of stop words, is using str.replace.
p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')
df
ID Text
0 1 eat launch
1 2 outside have fun
If performance is important, use a list comprehension.
cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words])
for x in df['Text'].tolist()]
df
ID Text
0 1 eat launch
1 2 outside have fun

Extracting a word and its prior 10 word context to a dataframe in Python

I'm fairly new to Python (2.7), so forgive me if this is a ridiculously straightforward question. I wish (i) to extract all the words ending in -ing from a text that has been tokenized with the NLTK library and (ii) to extract the 10 words preceding each word thus extracted. I then wish (iii) to save these to file as a dataframe of two columns that might look something like:
Word PreviousContext
starting stood a moment, as if in a troubled reverie; then
seeming of it retraced our steps. But Elijah passed on, without
purchasing a sharp look-out upon the hands: Bildad did all the
I know how to do (i), but am not sure how to go about doing (ii)-(iii). Any help would be greatly appreciated and acknowledged. So far I have:
>>> import bs4
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
... if w.endswith("ing"):
... print(w)
...
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..
After the code line:
>>> tokens = word_tokenize(raw)
use the below code to generate words with their context:
>>> context={}
>>> for i,w in enumerate(tokens):
... if w.endswith("ing"):
... try:
... context[w]=tokens[i:i+10] # this try...except is used to pass last 10 words whose context is less than 10 words.
... except: pass
...
>>> fp=open('dataframes','w') # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
... fp.write(word+'\t\t'+' '.join(context[word])+'\n')
...
>>> fp.close()
>>> fp=open('dataframes','r')
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
... print line
...
Word PreviousContext
raining raining , and I saw more fog and mud in
bidding bidding him good night , if he were yet sitting
growling growling old Scotch Croesus with great flaps of ears ?
bright-looking bright-looking bride , I believe ( as I could not
hanging hanging up in the shop&mdash ; went down to look
scheming scheming and devising opportunities of being alone with her .
muffling muffling her hands in it , in an unsettled and
bestowing bestowing them on Mrs. Gummidge. She was with him all
adorning adorning , the perfect simplicity of his manner , brought
Two things to note:
nltk treats punctuations as separate tokens, so punctuations are treated as seperate words.
I've used dictionary to store words with their context, so the order of words will be irrelevant but it is guaranteed that all words with their context are present.
Say you have all your words in a list of words:
>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']
I would put them into a series and grab indices of relevant words:
words = pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0 9
1 20
dtype: int64
Now the values of idx are the indices of words ending in 'ing' in our original Series. Next we need to turn these values into ranges:
starts = idx - 10
ends = idx
Now we can index into the original series with these ranges (first though, clip with a lower bound of 0, in case an 'ing' word appears less than 10 words into a list):
starts = starts.clip(0)
df = pandas.DataFrame([{
'word': words[e],
'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
Previous word
0 abc def gdi asd ew d ew fdsa dsa aing
1 e f dsa fe dfa e d fe asd fe ting
Not exactly a one liner, but it works.
note the reason 'aing' only has 9 words in the corresponding column is because it appeared too early in the fake list I made.
If you are asking how to do this algorithmically, to begin I would maintain a queue of the previous 10 words at all times and a dataframe where the first column is words ending in 'ing' and the second column is queues of the 10 words preceding the corresponding word (in the first column).
So at the start of your program, the queue would be empty then for the first 10 words, it would enqueue each word. Then each time before moving forward in your loop, enqueue your current word and dequeue a word (making sure to maintain a queue of size 10).
This way, at each iteration you check if the word ends with 'ing'. If it does, add a row to your dataframe where the word is the first item and the second item is the current state of the queue.
In the end, you should have a dataframe with a first column of words which end in 'ing' and its corresponding second column being the 10 words preceding it.

Categories

Resources