Stopword removal with pandas - python

I would like to remove stopwords from a column of a data frame.
Inside the column there is text which needs to be splitted.
For example my data frame looks like this:
ID Text
1 eat launch with me
2 go outside have fun
I want to apply stopword on text column so it should be splitted.
I tried this:
for item in cached_stop_words:
if item in df_from_each_file[['text']]:
print(item)
df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')
So my output should be like this:
ID Text
1 eat launch
2 go fun
It means stopwords have been deleted.
but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.
Thanks for your help.

replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.
One simple solution, when you have a manageable number of stop words, is using str.replace.
p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')
df
ID Text
0 1 eat launch
1 2 outside have fun
If performance is important, use a list comprehension.
cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words])
for x in df['Text'].tolist()]
df
ID Text
0 1 eat launch
1 2 outside have fun

Related

How to extract the top N rows from a dataframe with most frequent occurences of a word in a list?

I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -
DocName
Content
Doc1
Hi how you are doing ? Hope you are well. I hear the food is great!
Doc2
The food is great. James loves his food. You not so much right ?
Doc3.
Yeah he is alright.
I also have a list of 100 words as follows -
list = [food, you, ....]
Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,
"food" occurs twice in Doc2 and once in Doc1.
"you" occurs twice in Doc 1 and once in Doc 2.
Hence, desired output is :
[food:[doc2, doc1], you:[doc1, doc2], .....]
where N = 2 ( top 2 rows having the most frequent occurence of each
word )
I have tried something as follows but unsure how to move further -
list = [food, you, ....]
result = []
for word in list:
result.append(df.Content.apply(lambda row: sum([row.count(word)])))
How can I implement an efficient solution to the above requirement in Python ?
Second attempt (initially I misunderstood your requirements): With df your dataframe you could try something like:
words = ["food", "you"]
n = 2 # Number of top docs
res = (
df
.assign(Content=df["Content"].str.casefold().str.findall(r"\w+"))
.explode("Content")
.loc[lambda df: df["Content"].isin(set(words))]
.groupby("DocName").value_counts().rename("Counts")
.sort_values(ascending=False).reset_index(level=0)
.assign(DocName=lambda df: df["DocName"] + "_" + df["Counts"].astype("str"))
.groupby(level=0).agg({"DocName": list})
.assign(DocName=lambda df: df["DocName"].str[:n])
.to_dict()["DocName"]
)
The first 3 lines in the pipeline extract the relevant words, one per row. For the sample that looks like:
DocName Content
0 Doc1 you
0 Doc1 you
0 Doc1 food
1 Doc2 food
1 Doc2 food
1 Doc2 you
The next 2 lines count the words per doc (.groupby and .value_counts), and sort the result by the counts in descending order (.sort_values), and add the count to the doc-strings. For the sample:
DocName Counts
Content
you Doc1_2 2
food Doc2_2 2
food Doc1_1 1
you Doc2_1 1
Then .groupby the words (index) and put the respective docs in a list via .agg, and restrict the list to the n first items (.str[:n]). For the sample:
DocName
Content
food [Doc2_2, Doc1_1]
you [Doc1_2, Doc2_1]
Finally dumping the result in a dictionary.
Result for the sample dataframe
DocName Content
0 Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
1 Doc2 The food is great. James loves his food. You not so much right ?
2 Doc3 Yeah he is alright.
is
{'food': ['Doc2_2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']}
It seems like this problem can be broken down into two sub-problems:
Get the frequency of words per "Content" cell
For each word in the list, extract the top N rows
Luckily, the first sub-problem has many neat approaches, as shown here. TLDR use the Collections library to do a frequency count; or, if you aren't allowed to import libraries, call ".split()" and count in a loop. But again, there are many potential solutions
The second sub-problem is a bit trickier. From our first solution, what we have now is a dictionary of frequency counts, per row. To get to our desired answer, the naive method would be to "query" every dictionary for the word in question.
E.g run
doc1.dict["food"]
doc2.dict["food"]
...
and compare the results in order.
There should be enough to get going, and also opportunity to find more streamlined/elegant solutions. Best of luck!

NLP preprocessing remove all words in string not found in my list

I've made a list of important_words and a have a dataframe that has a column df['reviews'], that has one string of review text per row (thousands of rows). I want to update the 'reviews' by removing everything that is not in the important_words list from the string, like the opposite of having stop words, so that I am only left with the important_words per every review (row) in the df.
Also, later in my starter code I tokenize and normalize the column of df[reviews], it seems like applying to this column should make everything easier, since punctuation removal and lowercasing has also been applied. I'll try which ever method someone can share, thanks.
important_words = [actor, action, awesome]
df['reviews'][1] = 'The actor, in the action movie was awesome'
df['reviews'][2] = 'The action movie was not good'
....
df['tokenized_normalized_reviews'][1] = [the,actor,in,the,action,movie,was,awesome]
df['tokenized_normalized_reviews'][2] = [the, action, movie, was, not, good]
I want:
df['review_important_words'][1] = 'actor, action, awesome'
df['review_important_words'][2] = 'action'
< either str or applied to the tokenized column>
df['reviews'] = df['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word in (important_words)]))
You can do it like this using pandas. Applying the function would make it work for all the elements of this column.

remove words starting with "#" in a column from a dataframe

I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)

Trying to remove my own stopwords using join/split

I am trying to remove stopwords in df['Sentences'] as I would need to plot it.
My sample is
... 13
London 12
holiday 11
photo 7
. 7
..
walk 1
dogs 1
I have built my own dictionary and I would like to use it to remove the stop_words from that list.
What I have done is as follows:
import matplotlib.pyplot as plt
df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()
Although it does not give me any error, the stopwords and punctuation are still there. Also, I would like to not alterate the column, but just looking at the results for a short analysis (for example, creating a copy of the original column).
How could I remove them?
Let's assume you have this dataframe with this really interesting conversation.
df = pd.DataFrame({'Sentences':['Hello, how are you?',
'Hello, I am fine. Have you watched the news',
'Not really the news ...']})
print (df)
Sentences
0 Hello, how are you?
1 Hello, I am fine. Have you watched the news
2 Not really the news ...
Now you want to remove the punctuation and the stopwords from my_dict, you can do it like this
my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^\w\s]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news 2
hello 2
watched 1
fine 1
not 1
really 1
how 1
1
Name: Sentences, dtype: int64
This might not be the faster way though

Replace specific words by user dictionary and others by 0

So I have a review dataset having reviews like
Simply the best. I bought this last year. Still using. No problems
faced till date.Amazing battery life. Works fine in darkness or broad
daylight. Best gift for any book lover.
(This is from the original dataset, I have removed all punctuation and have all lower case in my processed dataset)
What I want to do is replace some words by 1(as per my dictionary) and others by 0.
My dictionary is
dict = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
I want my output like:
0010000000000001000000000100000
I have used this code:
df['newreviews'] = df['reviews'].map(dict).fillna("0")
This always returns 0 as output. I did not want this so I took 1s and 0s as strings, but despite that I'm getting the same result.
Any suggestions how to solve this?
First dont use dict as variable name, because builtins (python reserved word), then use list comprehension with get for replace not matched values to 0.
Notice:
If data are like date.Amazing - no space after punctuation is necessary replace by whitespace.
df = pd.DataFrame({'reviews':['Simply the best. I bought this last year. Still using. No problems faced till date.Amazing battery life. Works fine in darkness or broad daylight. Best gift for any book lover.']})
d = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
df['reviews'] = df['reviews'].str.replace(r'[^\w\s]+', ' ').str.lower()
df['newreviews'] = [''.join(d.get(y, '0') for y in x.split()) for x in df['reviews']]
Alternative:
df['newreviews'] = df['reviews'].apply(lambda x: ''.join(d.get(y, '0') for y in x.split()))
print (df)
reviews \
0 simply the best i bought this last year stil...
newreviews
0 0011000000000001000000000100000
You can do:
# clean the sentence
import re
sent = re.sub(r'\.','',sent)
# convert to list
sent = sent.lower().split()
# get values from dict using comprehension
new_sent = ''.join([str(1) if x in mydict else str(0) for x in sent])
print(new_sent)
'001100000000000000000000100000'
You can do it by
df.replace(repl, regex=True, inplace=True)
where df is your dataframe and repl is your dictionary.

Categories

Resources