python dataframe count word occurrences

python dataframe count word occurrences - python

I have searched a lot here and I couldnt find the answer for it.
I have a dataframe with column "Descriptions" which contain a long string,
I'm trying to count the number of occurence for a specific word "restaurant",
df['has_restaurants'] = 0
for index,text in enumerate(df['Description']):
text = text.split()
df['has_restaurants'][index] = (sum(map(lambda count : 1 if 'restaurant' in count else 0, text)))
Did the above and it works but it doesn't look like a good way to do it and it generates this "error" as well:
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['has_restaurants'][index] = (sum(map(lambda count : 1 if 'restaurant' in count else 0, text)))

You might simplify that by using .str.count method, consider following simple example
import pandas as pd
df = pd.DataFrame({"description":["ABC DEF GHI","ABC ABC ABC","XYZ XYZ XYZ"]})
df['ABC_count'] = df.description.str.count("ABC")
print(df)
output
description ABC_count
0 ABC DEF GHI 1
1 ABC ABC ABC 3
2 XYZ XYZ XYZ 0

You could use Python's native .count() method:
df['has_restaurants'] = 0
for index,text in enumerate(df['Description']):
df['has_restaurants'][index] = text.count('restaurant')

Related

Python too slow to find text in string in for loop

I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text'].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5

You can do this by making a Counter from the words in each Text value, then converting that into columns (using pd.Series), summing the columns that don't exist in wordlist into other_words and then dropping those columns:
import re
import pandas as pd
from collections import Counter
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
I've converted all the words to lower-case so you're not counting I and i separately.
I've used re.findall rather than the more obvious split() so that forever... gets counted as the word forever rather than forever...
If you only want to count the words in wordlist (and don't want an other_words count), you can simplify this to:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
Output:
No Text i love
0 1 I love you forever... other words 1 1
1 2 No , i know that you know xxx words 1 0
Another way of also generating the other_words value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))
Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total function:
(c2 - counters).apply(Counter.total)

as an alternative you could try this:
counts = (df['Text'].str.lower().str.findall(r'\b[a-z]+\b')
.apply(lambda x: pd.Series(x).value_counts())
.filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts
print(df)
'''
№ Text i love
0 1 I love you forever... other words 1.0 1.0
1 2 No , i know that you know xxx words 1.0 0.0

How do I change the same string within a column and make it permanent using Pandas

I'm trying to change the Strings "SLL" under the competitions column to "League" but when i tried this:
messi_dataset.replace("SLL", "League",regex = True)
It only changed the first "SLL" to "League" but then other strings that were "SLL" became "UCL. I have no idea why. I also tried changing regex = True to inlace = True but no luck.
https://drive.google.com/file/d/1ldq6o70j-FsjX832GbYq24jzeR0IwlEs/view?usp=sharing
https://drive.google.com/file/d/1OeCSutkfdHdroCmTEG9KqnYypso3bwDm/view?usp=sharing

Suppose you have a dataframe as below:
import pandas as pd
import re
df = pd.DataFrame({'Competitions': ['SLL', 'sll','apple', 'banana', 'aabbSLL', 'ccddSLL']})
# write a regex pattern that replaces 'SLL'
# I assumed case-irrelevant
regex_pat = re.compile(r'SLL', flags=re.IGNORECASE)
df['Competitions'].str.replace(regex_pat, 'league', regex=True)
# Input DataFrame
Competitions
0 SLL
1 sll
2 apple
3 banana
4 aabbSLL
5 ccddSLL
Output:
0 league
1 league
2 apple
3 banana
4 aabbleague
5 ccddleague
Name: Competitions, dtype: object
Hope it clarifies.

base on this Answer test this code:
messi_dataset['competitions'] = messi_dataset['competitions'].replace("SLL", "League")
also, there are many different ways to do this like this one that I test:
messi_dataset.replace({'competitions': 'SLL'}, "League")
for those cases that 'SLL' is a part of another word:
messi_dataset.replace({'competitions': 'SLL'}, "League", regex=True)

Conditional Extraction of Data from Pandas Dataframe

I have a simple DataFrame that looks like:
Names
0 Alexi Laiho
1 Jari Maenpaa
2 Kirk Hammett
3 Antti Kokko
4 Yngwie Malmsteen
5 Petri Lindroos
I want to retrieve records which only have more than 5 vowels in their names.
For this I made function:
def vowcount(sentence=[]):
count=0
vow='aeiouAEIOU'
for i in sentence:
for j in i:
if j in vow:
count+=1
return count
How can I use this function to extract records from the DataFrame?
Please help me to understand how to use df.apply(map()) function on this Pandas Series and how to get the same using list comprehension if possible.

We can use a simple regex statement and using str.lower, str.count and .query:
m = df['Names'].str.lower().str.count(r'[aeiou]')
df = df.query('#m > 5')
Or we can use re.I to ignore case:
import re
m = df['Names'].str.count(r'[aeiou]', flags = re.I)
df = df.query('#m > 5')
Output
Names
0 Alexi Laiho
1 Jari Maenpaa

Alternatively with findall:
import re
df[df.Names.str.findall('[aeiou]',flags=re.I).str.len().gt(5)]
Names
0 Alexi Laiho
1 Jari Maenpaa

Replace partial string/char in columdata of Panda dataframe

I have a dataframe as follows:
Name Rating
0 ABC Good
1 XYZ Good #
2 GEH Good
3 ABH *
4 FEW Normal
Here I want to replace in the Rating element if it contain # it should replace by Can be improve , if it contain * then Very Poor. I have tried with following but it replace whole string. But I want to replace only the special char if it present.But it solves for another case if only special char is present.
import pandas as pd
df = pd.DataFrame() # Load with data
df['Rating'] = df['Rating'].str.replace('.*#+.*', 'Can be improve')
is returning
Name Rating
0 ABC Good
1 XYZ Can be improve
2 GEH Good
3 ABH Very Poor
4 FEW Normal
Can anybody help me out with this?

import pandas as pd
df = pd.DataFrame({"Rating": ["Good", "Good #", "*"]})
df["Rating"] = df["Rating"].str.replace("#", "Can be improve")
df["Rating"] = df["Rating"].str.replace("*", "Very Poor")
print(df)
Output:
0 Good
1 Good Can be improve
2 Very Poor

You replace the whole string because .* matches any character zero or more times.
If your special values are always at the end of the string you might use:
.str.replace(r'#$', "Can be improve")
.str.replace(r'\*$', "Very Poor")

How to create grouped piechart

I have this pandas dataframe:
df =
GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0
I need to create a piechart (using Python or R). The size of each pie should correspond to the proportional count (i.e. the percent) of rows with particular GROUP. Moreover, each pie should be divided into 2 sub-parts corresponding to the percent of rows with MARK==1 and MARK==0 within given GROUP.
I was googling for this type of piecharts and found this one. But this example seems to be overcomplicated for my case. Another good example is done in JavaScript, which doesn't serve for me because of the language.
Can somebody tell me what's the name of this type of piecharts and where can I find some examples of code in Python or R.

Here is a solution in R that uses base R only. Not sure how you want to arrange your pies, but I used par(mfrow=...).
df <- read.table(text=" GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0", header=TRUE)
plot_pie <- function(x, multiplier=1, label){
pie(table(x), radius=multiplier * length(x), main=label)
}
par(mfrow=c(1,3), mar=c(0,0,2,0))
invisible(lapply(split(df, df$GROUP), function(x){
plot_pie(x$MARK, label=unique(x$GROUP),
multiplier=0.2)
}))
This is the result:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python dataframe count word occurrences - python

You could use Python's native .count() method: df['has_restaurants'] = 0 for index,text in enumerate(df['Description']): df['has_restaurants'][index] = text.count('restaurant')

Related

Python too slow to find text in string in for loop

How do I change the same string within a column and make it permanent using Pandas

Conditional Extraction of Data from Pandas Dataframe

Replace partial string/char in columdata of Panda dataframe

How to create grouped piechart

Categories

Resources