I have a Pandas DataFrame (or a Series, given that I'm just using one column) that contains strings. I also have a list of words. For each word in this list, I want to check how many different rows it appears in at least once. For example:
words = ['hi', 'bye', 'foo', 'bar']
df = pd.Series(["hi hi hi bye foo",
"bye bye bye bye",
"bar foo hi bar",
"hi bye foo bar"])
In this case, the output should be
0 hi 3
1 bye 3
2 foo 3
3 bar 2
Because "hi" appears in three different rows (1st, 3rd and 4th), "bar" appears in two (3rd and 4th), and so on.
I came up with the following way to do this:
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances.update({word: appearances})
pd.DataFrame(word_appearances.items())
This works fine, but the problem is that I have a rather long list of words (around 40,000), around 30,000 rows to check and strings that are not as short as the ones I used in the example. When I try my approach with my real data, it takes forever to run. Is there a way to do this in a more efficient way?
Try list comprehension and str.contains and sum
df_out = pd.DataFrame([[word, sum(df.str.contains(word))] for word in words],
columns=['word', 'word_count'])
Out[58]:
word word_count
0 hi 3
1 bye 3
2 foo 3
3 bar 2
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances[word]= appearances
pd.DataFrame.from_dict(word_appearances,columns=['Frequency'],orient='index')
Related
I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -
DocName
Content
Doc1
Hi how you are doing ? Hope you are well. I hear the food is great!
Doc2
The food is great. James loves his food. You not so much right ?
Doc3.
Yeah he is alright.
I also have a list of 100 words as follows -
list = [food, you, ....]
Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,
"food" occurs twice in Doc2 and once in Doc1.
"you" occurs twice in Doc 1 and once in Doc 2.
Hence, desired output is :
[food:[doc2, doc1], you:[doc1, doc2], .....]
where N = 2 ( top 2 rows having the most frequent occurence of each
word )
I have tried something as follows but unsure how to move further -
list = [food, you, ....]
result = []
for word in list:
result.append(df.Content.apply(lambda row: sum([row.count(word)])))
How can I implement an efficient solution to the above requirement in Python ?
Second attempt (initially I misunderstood your requirements): With df your dataframe you could try something like:
words = ["food", "you"]
n = 2 # Number of top docs
res = (
df
.assign(Content=df["Content"].str.casefold().str.findall(r"\w+"))
.explode("Content")
.loc[lambda df: df["Content"].isin(set(words))]
.groupby("DocName").value_counts().rename("Counts")
.sort_values(ascending=False).reset_index(level=0)
.assign(DocName=lambda df: df["DocName"] + "_" + df["Counts"].astype("str"))
.groupby(level=0).agg({"DocName": list})
.assign(DocName=lambda df: df["DocName"].str[:n])
.to_dict()["DocName"]
)
The first 3 lines in the pipeline extract the relevant words, one per row. For the sample that looks like:
DocName Content
0 Doc1 you
0 Doc1 you
0 Doc1 food
1 Doc2 food
1 Doc2 food
1 Doc2 you
The next 2 lines count the words per doc (.groupby and .value_counts), and sort the result by the counts in descending order (.sort_values), and add the count to the doc-strings. For the sample:
DocName Counts
Content
you Doc1_2 2
food Doc2_2 2
food Doc1_1 1
you Doc2_1 1
Then .groupby the words (index) and put the respective docs in a list via .agg, and restrict the list to the n first items (.str[:n]). For the sample:
DocName
Content
food [Doc2_2, Doc1_1]
you [Doc1_2, Doc2_1]
Finally dumping the result in a dictionary.
Result for the sample dataframe
DocName Content
0 Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
1 Doc2 The food is great. James loves his food. You not so much right ?
2 Doc3 Yeah he is alright.
is
{'food': ['Doc2_2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']}
It seems like this problem can be broken down into two sub-problems:
Get the frequency of words per "Content" cell
For each word in the list, extract the top N rows
Luckily, the first sub-problem has many neat approaches, as shown here. TLDR use the Collections library to do a frequency count; or, if you aren't allowed to import libraries, call ".split()" and count in a loop. But again, there are many potential solutions
The second sub-problem is a bit trickier. From our first solution, what we have now is a dictionary of frequency counts, per row. To get to our desired answer, the naive method would be to "query" every dictionary for the word in question.
E.g run
doc1.dict["food"]
doc2.dict["food"]
...
and compare the results in order.
There should be enough to get going, and also opportunity to find more streamlined/elegant solutions. Best of luck!
I have a large DataFrame df with 40k+ rows:
words
0 coffee in the morning
1 good morning
2 hello my name is
3 hello world
...
I have a list of English stopwords that I want to remove from the column, which I do as follows:
df["noStopwords"] = df["words"].apply(lambda x: [i for i in x if i not in stopwords])
noStopwords
0 coffee morning
1 good morning
2 hello name
3 hello world
...
This works but takes too long. Is there an efficient way of doing this?
EDIT:
print(stopwords)
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', ...]
Not sure if it's any faster, but you could try str.replace. Since words is actually a list, we also have to join and split as well.
import re
pattern = rf"\b(?:{'|'.join(map(re.escape, stopwords))})\b"
df['noStopwords'] = (df.words
.str.join(' ')
.str.replace(pattern, '', regex=True)
.str.split('\s+')
)
I have a data frame (let's call it 'littletext') that has a column with sentences within each row. I also have another table (let's call it 'littledict') that I would like to use as a reference by which to find and replace words and/or phrases within each row of 'littletext'.
Here are my two data frames. I am hard-coding them in this example but will load them as csv files in "real life":
raw_text = {
"text": ["Hello, world!", "Hello, how are you?", "This world is funny!"],
"col2": [0,1,1]}
littletext = pd.DataFrame(raw_text, index = pd.Index(['A', 'B', 'C'], name='letter'), columns = pd.Index(['text', 'col2'], name='attributes'))
raw_dict = {
"key": ["Hello", "This", "funny"],
"replacewith": ["Hi", "That", "hilarious"]}
littledict = pd.DataFrame(raw_dict, index = pd.Index(['a','b','c'], name='letter'), columns = pd.Index(['key', 'replacewith'], name='attributes'))
print(littletext) # ignore 'col2' since it is irrelevant in this discussion
text col2
A Hello, world! 0
B Hello, how are you? 1
C This world is funny! 1
print(littledict)
key replacewith
a Hello Hi
b This That
c funny hilarious
I would like to have 'littletext' modified as per below wherein Python will look at more than one word within each sentence of my 'littletext' table (dataframe) and replace multiple words, acting on all rows. The final product should be that 'Hello' has been replaced by 'Hi' in lines A and B, and 'That' was replaced with 'This' and 'funny' was replaced with 'hilarious', both within line C:
text col2
A Hi, world! 0
B Hi, how are you? 1
C That world is hilarious! 1
Here are two attempts that I have tried but neither of which work. They are not generating errors, they are just not modifying 'littletext' as I described above. Attempt #1 'technically' works but it is inefficient and therefore useless for large-scale jobs because I would have to anticipate and program every possible sentence I would need to replace other sentence. Attempt #2 simply does not change anything at all.
My two Attempts that do NOT work are:
Attempt #1: this is not helpful because to use it, I would have to program entire sentences to replace other sentences, which is pointless:
littltext['text'].replace({'Hello, world!': 'Hi there, world.', 'This world is funny!': 'That world is hilarious'})
Attempt #1 returns:
Out[125]:
0 Hi there, world.
1 Hello, how are you?
2 That world is hilarious
Name: text, dtype: object
Attempt #2: this attempt is closer to the mark but returns no changes whatsoever:
for key in littledict:
littletext = littletext.replace(key,littledict[key])
Attempt #2 returns:
text col2
0 Hello, world! 0
1 Hello, how are you? 1
2 This world is funny! 1
I have scoured the internet, including Youtube, Udemy, etc., but to no avail. Numerous 'tutorial' sites only cover individual text examples, not entire columns of sentences like the example I am showing and are therefore useless in scaling up to industry-size projects. I am hoping someone can graciously shed light on this since this kind of text manipulation is commonplace in many industry settings.
My humble thanks and appreciation to anyone who can help!!
dict littledict to enable you generate a regex and use the regex in .replace.str()to replace the characters you need as follows
s=dict(zip(littledict.key,littledict.replacewith))
littletext['text'].str.replace('|'.join(s), lambda x: s[x.group()])
0 Hi, world!
1 Hi, how are you?
2 That world is hilarious!
Name: text, dtype: object
you were pretty close with the first attempt. You can create the dictionary from littledict with key in index and use regex=True.
print (littletext['text']
.replace(littledict.set_index('key')
['replacewith'].to_dict(),
regex=True)
)
0 Hi, world!
1 Hi, how are you?
2 That world is hilarious!
Name: text, dtype: object
I am trying to remove stopwords in df['Sentences'] as I would need to plot it.
My sample is
... 13
London 12
holiday 11
photo 7
. 7
..
walk 1
dogs 1
I have built my own dictionary and I would like to use it to remove the stop_words from that list.
What I have done is as follows:
import matplotlib.pyplot as plt
df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()
Although it does not give me any error, the stopwords and punctuation are still there. Also, I would like to not alterate the column, but just looking at the results for a short analysis (for example, creating a copy of the original column).
How could I remove them?
Let's assume you have this dataframe with this really interesting conversation.
df = pd.DataFrame({'Sentences':['Hello, how are you?',
'Hello, I am fine. Have you watched the news',
'Not really the news ...']})
print (df)
Sentences
0 Hello, how are you?
1 Hello, I am fine. Have you watched the news
2 Not really the news ...
Now you want to remove the punctuation and the stopwords from my_dict, you can do it like this
my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^\w\s]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news 2
hello 2
watched 1
fine 1
not 1
really 1
how 1
1
Name: Sentences, dtype: int64
This might not be the faster way though
I have a very large data frame with two columns called sentence1 and sentence2.
I am trying to make a new column with the words that differ between two sentences, for example:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))
My data frame has the following structure:
ID sentence1 sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six
And my expected result is:
ID sentence1 sentence2 Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six
In R I was trying to split the sentences and after get the elements which differ between the lists, something like:
df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)
But this approach does not work when applying setdiff...
In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:
from nltk.tokenize import word_tokenize
df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))
And at this point I do not find a function which give me the result i need..
I hope you can help me. Thanks
Here's an R solution.
I've created an exclusiveWords function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize() so that it works on all rows of the data.frame at once.
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)
exclusiveWords <- function(x, y){
x <- strsplit(x, " ")[[1]]
y <- strsplit(y, " ")[[1]]
u <- union(x, y)
u <- union(setdiff(u, x), setdiff(u, y))
return(paste0(u, collapse = " "))
}
exclusiveWords <- Vectorize(exclusiveWords)
df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
# sentence1 sentence2 result
# 1 This is sentence one This is the sentence four the four one
# 2 This is sentence two This is the sentence five the five two
# 3 This is sentence three This is the sentence six the six three
Essentially the same as the answer from #SymbolixAU as an apply function.
df$Dif <- apply(df, 1, function(r) {
paste(setdiff(union (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))),
collapse = " ")
})
In Python, you can build a function that treats words in a sentence as a set and calculates the set theoretical exclusive 'or' (a set of words that are in one sentence but not in the other):
df.apply(lambda x:
set(word_tokenize(x['sentence1'])) \
^ set(word_tokenize(x['sentence2'])), axis=1)
The result is a dataframe of sets.
#0 {one, the, four}
#1 {the, two, five}
#2 {the, three, six}
#dtype: object