Most efficient way to remove list elements from column - python

I have a large DataFrame df with 40k+ rows:
words
0 coffee in the morning
1 good morning
2 hello my name is
3 hello world
...
I have a list of English stopwords that I want to remove from the column, which I do as follows:
df["noStopwords"] = df["words"].apply(lambda x: [i for i in x if i not in stopwords])
noStopwords
0 coffee morning
1 good morning
2 hello name
3 hello world
...
This works but takes too long. Is there an efficient way of doing this?
EDIT:
print(stopwords)
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', ...]

Not sure if it's any faster, but you could try str.replace. Since words is actually a list, we also have to join and split as well.
import re
pattern = rf"\b(?:{'|'.join(map(re.escape, stopwords))})\b"
df['noStopwords'] = (df.words
.str.join(' ')
.str.replace(pattern, '', regex=True)
.str.split('\s+')
)

Related

Check if a substring is in a string python

I have two dataframe, I need to check contain substring from first df in each string in second df and get a list of words that are included in the second df
First df(word):
word
apples
dog
cat
cheese
Second df(sentence):
sentence
apples grow on a tree
...
I love cheese
I tried this one:
tru=[]
for i in word['word']:
if i in sentence['sentence'].values:
tru.append(i)
And this one:
tru=[]
for i in word['word']:
if sentence['sentence'].str.contains(i):
tru.append(i)
I expect to get a list like ['apples',..., 'cheese']
One possible way is to use Series.str.extractall:
import re
import pandas as pd
df_word = pd.Series(["apples", "dog", "cat", "cheese"])
df_sentence = pd.Series(["apples grow on a tree", "i love cheese"])
pattern = f"({'|'.join(df_word.apply(re.escape))})"
matches = df_sentence.str.extractall(pattern)
matches
Output:
0
match
0 0 apples
1 0 cheese
You can then convert the results to a list:
matches[0].unique().tolist()
Output:
['apples', 'cheese']

How do I replace multiple words within each row of a column that contains full sentences?

I have a data frame (let's call it 'littletext') that has a column with sentences within each row. I also have another table (let's call it 'littledict') that I would like to use as a reference by which to find and replace words and/or phrases within each row of 'littletext'.
Here are my two data frames. I am hard-coding them in this example but will load them as csv files in "real life":
raw_text = {
"text": ["Hello, world!", "Hello, how are you?", "This world is funny!"],
"col2": [0,1,1]}
littletext = pd.DataFrame(raw_text, index = pd.Index(['A', 'B', 'C'], name='letter'), columns = pd.Index(['text', 'col2'], name='attributes'))
raw_dict = {
"key": ["Hello", "This", "funny"],
"replacewith": ["Hi", "That", "hilarious"]}
littledict = pd.DataFrame(raw_dict, index = pd.Index(['a','b','c'], name='letter'), columns = pd.Index(['key', 'replacewith'], name='attributes'))
print(littletext) # ignore 'col2' since it is irrelevant in this discussion
text col2
A Hello, world! 0
B Hello, how are you? 1
C This world is funny! 1
print(littledict)
key replacewith
a Hello Hi
b This That
c funny hilarious
I would like to have 'littletext' modified as per below wherein Python will look at more than one word within each sentence of my 'littletext' table (dataframe) and replace multiple words, acting on all rows. The final product should be that 'Hello' has been replaced by 'Hi' in lines A and B, and 'That' was replaced with 'This' and 'funny' was replaced with 'hilarious', both within line C:
text col2
A Hi, world! 0
B Hi, how are you? 1
C That world is hilarious! 1
Here are two attempts that I have tried but neither of which work. They are not generating errors, they are just not modifying 'littletext' as I described above. Attempt #1 'technically' works but it is inefficient and therefore useless for large-scale jobs because I would have to anticipate and program every possible sentence I would need to replace other sentence. Attempt #2 simply does not change anything at all.
My two Attempts that do NOT work are:
Attempt #1: this is not helpful because to use it, I would have to program entire sentences to replace other sentences, which is pointless:
littltext['text'].replace({'Hello, world!': 'Hi there, world.', 'This world is funny!': 'That world is hilarious'})
Attempt #1 returns:
Out[125]:
0 Hi there, world.
1 Hello, how are you?
2 That world is hilarious
Name: text, dtype: object
Attempt #2: this attempt is closer to the mark but returns no changes whatsoever:
for key in littledict:
littletext = littletext.replace(key,littledict[key])
Attempt #2 returns:
text col2
0 Hello, world! 0
1 Hello, how are you? 1
2 This world is funny! 1
I have scoured the internet, including Youtube, Udemy, etc., but to no avail. Numerous 'tutorial' sites only cover individual text examples, not entire columns of sentences like the example I am showing and are therefore useless in scaling up to industry-size projects. I am hoping someone can graciously shed light on this since this kind of text manipulation is commonplace in many industry settings.
My humble thanks and appreciation to anyone who can help!!
dict littledict to enable you generate a regex and use the regex in .replace.str()to replace the characters you need as follows
s=dict(zip(littledict.key,littledict.replacewith))
littletext['text'].str.replace('|'.join(s), lambda x: s[x.group()])
0 Hi, world!
1 Hi, how are you?
2 That world is hilarious!
Name: text, dtype: object
you were pretty close with the first attempt. You can create the dictionary from littledict with key in index and use regex=True.
print (littletext['text']
.replace(littledict.set_index('key')
['replacewith'].to_dict(),
regex=True)
)
0 Hi, world!
1 Hi, how are you?
2 That world is hilarious!
Name: text, dtype: object

Trying to remove my own stopwords using join/split

I am trying to remove stopwords in df['Sentences'] as I would need to plot it.
My sample is
... 13
London 12
holiday 11
photo 7
. 7
..
walk 1
dogs 1
I have built my own dictionary and I would like to use it to remove the stop_words from that list.
What I have done is as follows:
import matplotlib.pyplot as plt
df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()
Although it does not give me any error, the stopwords and punctuation are still there. Also, I would like to not alterate the column, but just looking at the results for a short analysis (for example, creating a copy of the original column).
How could I remove them?
Let's assume you have this dataframe with this really interesting conversation.
df = pd.DataFrame({'Sentences':['Hello, how are you?',
'Hello, I am fine. Have you watched the news',
'Not really the news ...']})
print (df)
Sentences
0 Hello, how are you?
1 Hello, I am fine. Have you watched the news
2 Not really the news ...
Now you want to remove the punctuation and the stopwords from my_dict, you can do it like this
my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^\w\s]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news 2
hello 2
watched 1
fine 1
not 1
really 1
how 1
1
Name: Sentences, dtype: int64
This might not be the faster way though

Count number of different rows in which each word appears

I have a Pandas DataFrame (or a Series, given that I'm just using one column) that contains strings. I also have a list of words. For each word in this list, I want to check how many different rows it appears in at least once. For example:
words = ['hi', 'bye', 'foo', 'bar']
df = pd.Series(["hi hi hi bye foo",
"bye bye bye bye",
"bar foo hi bar",
"hi bye foo bar"])
In this case, the output should be
0 hi 3
1 bye 3
2 foo 3
3 bar 2
Because "hi" appears in three different rows (1st, 3rd and 4th), "bar" appears in two (3rd and 4th), and so on.
I came up with the following way to do this:
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances.update({word: appearances})
pd.DataFrame(word_appearances.items())
This works fine, but the problem is that I have a rather long list of words (around 40,000), around 30,000 rows to check and strings that are not as short as the ones I used in the example. When I try my approach with my real data, it takes forever to run. Is there a way to do this in a more efficient way?
Try list comprehension and str.contains and sum
df_out = pd.DataFrame([[word, sum(df.str.contains(word))] for word in words],
columns=['word', 'word_count'])
Out[58]:
word word_count
0 hi 3
1 bye 3
2 foo 3
3 bar 2
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances[word]= appearances
pd.DataFrame.from_dict(word_appearances,columns=['Frequency'],orient='index')

Faster way to flatten list in Pandas Dataframe

I have a dataframe below:
import pandas
df = pandas.DataFrame({"terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]]})
My desired outcome is as follows:
df2 = pandas.DataFrame({"terms" : ['the boy and the goat','a girl and the cat', 'fish boy with the dog','when girl find the mouse', 'if dog see the cat']})
Is there a simple way to accomplish this without having to use a for loop to iterate through each row for each element and substring:
result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
x = df.terms.tolist()[i]
for y in x:
z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
flattened = pandas.DataFrame({'flattened_term':[z]})
result = result.append(flattened)
print(result)
Thank you.
This is certainly no way to avoid loops here, at least not implicitely. Pandas is not created to handle list objects as elements, it deals magnificently with numeric data, and pretty well with strings. In any case, your fundamental problem is that you are using pd.Dataframe.append in a loop, which is a quadratic time algorithm (the entire data-frame is re-created on each iteration). But you can probably just get away with the following, and it should be significantly faster:
>>> df
terms
0 [[the, boy, and, the goat], [a, girl, and, the...
1 [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
0
0 the boy and the goat
1 a girl and the cat
2 fish boy with the dog
3 when girl find the mouse
4 if dog see the cat
>>>

Categories

Resources