Pandas Group By Filtering Simplification

Pandas Group By Filtering Simplification - python

I've got a data frame that has been binned into age groups ('AgeGroups' column), then filtered to those below the poverty level (100). I'm wondering if there is a simple way to calculate the count of those below poverty divided by the total amount of people, or poverty rate. This works but doesn't seem very pythonic.
The "PWGTP" column is the weight used to sum in this scenario.
pov_rate = df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /df.groupby('AgeGroups').sum()['PWGTP']
Thank you

It's not clear from your description why you need a groupby. The data is already binned. Why not simply create a poverty rate column?
df['pov_rate']=(df['POV']<100)*df['PWGTP']/df['PWGTP'].sum()

Some another solutions:
Filtered only column PWGTP for aggregate sum, very important if more numeric columns:
pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
df.groupby('AgeGroups')['PWGTP'].sum())
print (pov_rate)
Only one groupby with helper column filt:
pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
.groupby('AgeGroups')[['filt','PWGTP']].sum()
.eval('filt / PWGTP'))
print (pov_rate)
Performance depends of number of groups, number of matched rows, number of numeric columns and length of Dataframe, so in real data should be different.
np.random.seed(2020)
N = 1000000
df = pd.DataFrame({'AgeGroups':np.random.randint(10000,size=N),
'POV': np.random.randint(50, 500, size=N),
'PWGTP':np.random.randint(100,size=N),
'a':np.random.randint(100,size=N),
'b':np.random.randint(100,size=N),
'c':np.random.randint(100,size=N)})
# print (df)
In [13]: %%timeit
...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /
...: df.groupby('AgeGroups').sum()['PWGTP'])
...:
209 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %%timeit
...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
...: df.groupby('AgeGroups')['PWGTP'].sum())
...:
85.8 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %%timeit
...: pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
...: .groupby('AgeGroups')[['filt','PWGTP']].sum()
...: .eval('filt / PWGTP'))
...:
122 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Best way to execute multiple lines of pandas in parallel? (Speed up)

Basically, I am performing simple operation and updating 100 columns of my dataframe of size (550 rows and 2700 columns).
I am updating 100 columns like this:
df["col1"] = df["static"]-df["col1"])/df["col1"]*100
df["col2"] = (df["static"]-df["col2"])/df["col2"]*100
df["col3"] = (df["static"]-df["col3"])/df["col3"]*100
....
....
df["col100"] = (df["static"]-df["col100"])/df["col100"]*100
This operation is taking 170 ms in my original dataframe. I want to speed up the time. I am doing some real-time thing, so time is important.

You can select all columns and subtract with right side by DataFrame.rsub with DataFrame.div only columns vby list cols`:
cols = [f'col{c}' for c in range(1, 101)]
df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
Performance:
np.random.seed(2022)
df=pd.DataFrame(np.random.randint(1001, size=(550,2700))).add_prefix('col')
df = df.rename(columns={'col0':'static'})
In [58]: %%timeit
...: for i in range(1, 101):
...: df[f"col{i}"] = (df["static"]-df[f"col{i}"])/df[f"col{i}"]*100
...:
59.9 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [59]: %%timeit
...: cols = [f'col{c}' for c in range(1, 101)]
...: df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
...:
11.9 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How do I iterate through a Pandas dataframe with conditions? (confusion over iterrows/for loops/vectorization)

I have a dataset I need to iterate on a condition:
data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])
for x in range(len(df))
if df['state'] and df['capital'] = np.nan:
df['state'] = 'Investigate state'
df['capital'] = 'Investigate capital'
My expected output is that if the state field and capital fields are both empty then fill in the empty fields respectively. The actual data I use and the function within the loop is more complex than this example but what I want to focus on is the iterative/looping portion with the condition.
My Googling found iterrows and I read tutorials that just say to go ahead and use a for loop. Stackoverflow answers denounced the two options above and advocated to use vectorization instead. My actual dataset will have around ~20,000 rows. What is the most efficient implementation and how do I implement it?

You can test each column separately and chain masks by & for bitwise AND:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']
Fastest is in sample data for 30k rows and 66% matching if also set columns separately:
m = df['state'].isna() & df['capital'].isna()
df['state']= np.where(m, 'Investigate state', df['state'])
df['capital']= np.where(m, 'Investigate capital', df['capital'])
Similar:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, 'state']='Investigate state'
df.loc[m, 'capital']='Investigate capital'
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
...: m = df['state'].isna() & df['capital'].isna()
...: df['state']= np.where(m, 'Investigate state', df['state'])
...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
...:
3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']
4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another solutions:
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})
6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pandas frequency of a specific value per group

Suppose I have data for 50K shoppers and the products they bought. I want to count the number of times each user purchased product "a". value_counts seems to be the fastest way to calculate these types of numbers for a grouped pandas data frame. However, I was surprised at how much slower it was to calculate the purchase frequency for just one specific product (e.g., "a") using agg or apply. I could select a specific column from a data frame created using value_counts but that could be rather inefficient on very large data sets with lots of products.
Below a simulated example where each customer purchases 10 times from a set of three products. At this size you already notice speed differences between apply and agg compared to value_counts. Is there a better/faster way to extract information like this from a grouped pandas data frame?
import pandas as pd
import numpy as np
df = pd.DataFrame({
"col1": [f'c{j}' for i in range(10) for j in range(50000)],
"col2": np.random.choice(["a", "b", "c"], size=500000, replace=True)
})
dfg = df.groupby("col1")
# value_counts is fast
dfg["col2"].value_counts().unstack()
# apply and agg are (much) slower
dfg["col2"].apply(lambda x: (x == "a").sum())
dfg["col2"].agg(lambda x: (x == "a").sum())
# much faster to do
dfg["col2"].value_counts().unstack()["a"]
EDIT:
Two great responses to this question. Given the starting point of an already grouped data frame, it seems there may not be a better/faster way to count the number of occurrences of a single level in a categorical variable than using (1) apply or agg with a lambda function or (2) using value_counts to get the counts for all levels and then selecting the one you need.
The groupby/size approach is an excellent alternative to value_counts. With a minor edit to Cainã Max Couto-Silva's answer, this would give:
dfg = df.groupby(['col1', 'col2'])
dfg.size().unstack(fill_value=0)["a"]
I assume there would be a trade-off at some point where if you have many levels apply/agg or value_counts on an already grouped data frame may be faster than the groupby/size approach which requires creating a newly grouped data frame. I'll post back when I have some time to look into that.
Thanks for the comments and answers!

This is still faster:
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
Tests:
%%timeit
pd.crosstab(df.col1, df.col2)
# > 712 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 165 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we expand the dataframe to 5 million rows:
df = pd.concat([df for _ in range(10)])
print(f'df.shape = {df.shape}')
# > df.shape = (5000000, 2)
print(f'{df.shape[0]:,} rows.')
# > 5,000,000 rows.
%%timeit
pd.crosstab(df.col1, df.col2)
# > 1.58 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 1.27 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 847 ms ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Filter before value_counts
df.loc[df.col2=='a','col1'].value_counts()['c0']
Also I think crosstab is 'faster' than groupby + value_counts
pd.crosstab(df.col1, df.col2)

Optimization of the given operation, is there a better way?

I am a newbie and I need some insight. Say I have a pandas dataframe as follows:
temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)
I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".
What I did so far, is:
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.
Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?

One way using numpy.where:
temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
Benchmark (about 4x faster in sample, and keeps on increasing):
# With given sample of 100 rows
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Benchmark on larger data (about 7x faster)
temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True

Pandas vectorization: Compute the fraction of each group that meets a condition

Suppose we have a table of customers and their spending.
import pandas as pd
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Bob", "Charles"],
"Spend": [3, 5, 7, 9]
})
LIMIT = 6
For each customer, we may compute the fraction of his spending that is larger than 6, using the apply method:
df.groupby("Name").apply(
lambda grp: len(grp[grp["Spend"] > LIMIT]) / len(grp)
)
Name
Alice 0.0
Bob 0.5
Charles 1.0
However, the apply method is just a loop, which is slow if there are many customers.
Question: Is there a faster way, which presumably uses vectorization?
As of version 0.23.4, SeriesGroupBy does not support comparison operators:
(df.groupby("Name") ["Spend"] > LIMIT).mean()
TypeError: '>' not supported between instances of 'SeriesGroupBy' and 'int'
The code below results in a null value for Alice:
df[df["Spend"] > LIMIT].groupby("Name").size() / df.groupby("Name").size()
Name
Alice NaN
Bob 0.5
Charles 1.0
The code below gives the correct result, but it requires us to either modify the table, or make a copy to avoid modifying the original.
df["Dummy"] = 1 * (df["Spend"] > LIMIT)
df.groupby("Name") ["Dummy"] .sum() / df.groupby("Name").size()

Groupby does not use vectorization, but it has aggregate functions that are optimized with Cython.
You can take the mean:
(df["Spend"] > LIMIT).groupby(df["Name"]).mean()
df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
Or use div to replace NaN with 0:
df[df["Spend"] > LIMIT].groupby("Name").size() \
.div(df.groupby("Name").size(), fill_value = 0)
df["Spend"].gt(LIMIT).groupby(df["Name"]).sum() \
.div(df.groupby("Name").size(), fill_value = 0)
Each of the above will yield
Name
Alice 0.0
Bob 0.5
Charles 1.0
dtype: float64
Performance
Depends on the number of rows and number of rows filtered per condition, so it's best to test on real data.
np.random.seed(123)
N = 100000
df = pd.DataFrame({
"Name": np.random.randint(1000, size = N),
"Spend": np.random.randint(10, size = N)
})
LIMIT = 6
In [10]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
6.16 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df[df["Spend"] > LIMIT].groupby("Name").size().div(df.groupby("Name").size(), fill_value = 0)
6.35 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).sum().div(df.groupby("Name").size(), fill_value = 0)
9.66 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# RafaelC comment solution
In [13]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).sum() / s.size)
400 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).mean())
328 ms ± 6.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This NumPy solution is vectorized, but a bit complicated:
In [15]: %%timeit
...: i, r = pd.factorize(df["Name"])
...: a = pd.Series(np.bincount(i), index = r)
...:
...: i1, r1 = pd.factorize(df["Name"].values[df["Spend"].values > LIMIT])
...: b = pd.Series(np.bincount(i1), index = r1)
...:
...: df1 = b.div(a, fill_value = 0)
...:
5.05 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Group By Filtering Simplification - python

It's not clear from your description why you need a groupby. The data is already binned. Why not simply create a poverty rate column? df['pov_rate']=(df['POV']<100)*df['PWGTP']/df['PWGTP'].sum()

Related

Best way to execute multiple lines of pandas in parallel? (Speed up)

How do I iterate through a Pandas dataframe with conditions? (confusion over iterrows/for loops/vectorization)

pandas frequency of a specific value per group

Optimization of the given operation, is there a better way?

Pandas vectorization: Compute the fraction of each group that meets a condition

Categories

Resources