Pandas groupby apply is taking too much time

Pandas groupby apply is taking too much time - python

I am having the following code.
pd.DataFrame({'user_wid': {0: 3305613, 1: 57, 2: 80, 3: 31, 4: 38, 5: 12, 6: 35, 7: 25, 8: 42, 9: 16}, 'user_name': {0: 'Ter', 1: 'Am', 2: 'Wi', 3: 'Ma', 4: 'St', 5: 'Ju', 6: 'De', 7: 'Ri', 8: 'Ab', 9: 'Ti'}, 'user_age': {0: 41, 1: 34, 2: 45, 3: 47, 4: 70, 5: 64, 6: 64, 7: 63, 8: 32, 9: 24}, 'user_gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Male', 5: 'Female', 6: 'Female', 7: 'Female', 8: 'Female', 9: 'Female'}, 'sale_date': {0: '2018-05-15', 1: '2020-02-28', 2: '2020-04-02', 3: '2020-05-09', 4: '2020-11-29', 5: '2020-12-14', 6: '2020-04-21', 7: '2020-06-15', 8: '2020-07-03', 9: '2020-08-10'}, 'days_since_first_visit': {0: 426, 1: 0, 2: 0, 3: 8, 4: 126, 5: 283, 6: 0, 7: 189, 8: 158, 9: 270}, 'visit': {0: 4, 1: 1, 2: 1, 3: 2, 4: 4, 5: 3, 6: 1, 7: 2, 8: 4, 9: 2}, 'num_user_visits': {0: 4, 1: 2, 2: 1, 3: 2, 4: 10, 5: 7, 6: 1, 7: 4, 8: 4, 9: 2}, 'product': {0: 13, 1: 2, 2: 2, 3: 2, 4: 5, 5: 5, 6: 1, 7: 8, 8: 5, 9: 4}, 'sale_price': {0: 10.0, 1: 0.0, 2: 41.3, 3: 41.3, 4: 49.95, 5: 74.95, 6: 49.95, 7: 5.0, 8: 0.0, 9: 0.0}, 'whether_member': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}})
def f(x):
d = {}
d['user_name'] = x['user_name'].max()
d['user_age'] = x['user_age'].max()
d['user_gender'] = x['user_gender'].max()
d['last_visit_date'] = x['sale_date'].max()
d['days_since_first_visit'] = x['days_since_first_visit'].max()
d['num_visits_window'] = x['visit'].max()
d['num_visits_total'] = x['num_user_visits'].max()
d['products_used'] = x['product'].max()
d['user_total_sales'] = (x['sale_price'].sum()).round(2)
d['avg_spend_visit'] = (x['sale_price'].sum() / x['visit'].max()).round(2)
d['membership'] = x['whether_member'].max()
return pd.Series(d)
users = xactions.groupby('user_wid').apply(f).reset_index()
It is taking too much time to execute, I want to optimize the following function.
Any suggestions would be appreciated.
Thanks in advance.

Try:
users2 = xactions.groupby("user_wid", as_index=False).agg(
user_name=("user_name", "max"),
user_age=("user_age", "max"),
user_gender=("user_gender", "max"),
last_visit_date=("sale_date", "max"),
days_since_first_visit=("days_since_first_visit", "max"),
num_visits_window=("visit", "max"),
num_visits_total=("num_user_visits", "max"),
products_used=("product", "max"),
user_total_sales=("sale_price", "sum"),
membership=("whether_member", "max"),
)
users2["avg_spend_visit"] = (
users2["user_total_sales"] / users2["num_visits_window"]
).round(2)
print(users2)
Prints:
user_wid user_name user_age user_gender last_visit_date days_since_first_visit num_visits_window num_visits_total products_used user_total_sales membership avg_spend_visit
0 12 Ju 64 Female 2020-12-14 283 3 7 5 74.95 0 24.98
1 16 Ti 24 Female 2020-08-10 270 2 2 4 0.00 0 0.00
2 25 Ri 63 Female 2020-06-15 189 2 4 8 5.00 0 2.50
3 31 Ma 47 Male 2020-05-09 8 2 2 2 41.30 0 20.65
4 35 De 64 Female 2020-04-21 0 1 1 1 49.95 0 49.95
5 38 St 70 Male 2020-11-29 126 4 10 5 49.95 0 12.49
6 42 Ab 32 Female 2020-07-03 158 4 4 5 0.00 0 0.00
7 57 Am 34 Female 2020-02-28 0 1 2 2 0.00 0 0.00
8 80 Wi 45 Male 2020-04-02 0 1 1 2 41.30 0 41.30
9 3305613 Ter 41 Male 2018-05-15 426 4 4 13 10.00 0 2.50

Related

how to compare two dataframes by multiple columns and only append new entries in pandas?

I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.
So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?
Input data:
>>> src_df.to_dict()
{'yyyy_mm': {0: 202001,
1: 202002,
2: 202003,
3: 202002,
4: 202107,
5: 202108,
6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}
>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
1: 202002,
2: 202001,
3: 201901,
4: 201907,
5: 201908,
6: 201901,
7: 201907,
8: 201908},
'state': {0: 'CA',
1: 'NJ',
2: 'NY',
3: 'NY',
4: 'NY',
5: 'NY',
6: 'PA',
7: 'PA',
8: 'PA'},
'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}
My current attempt:
picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]
>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 6: 7}}
Then I tried to do following but output is not same as my expected output:
output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1) # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer') # second attempt
but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?
objective
I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?
Here is my desired output that I want to get:
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7

You should change your picked_rows DataFrame to only include dates that are greater than the hist_df dates:
#keep only rows that are newer than in hist_df
new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"].max()))]
#of the new rows, keep the latest updated values
picked_rows = new_data.loc[new_data.groupby("state")["yyyy_mm"].idxmax()]
#concat to hist_df
output_df = pd.concat([hist_df, picked_rows], ignore_index=True)
>>> output_df
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7

How to convert mulitple columns of a df from hexadecimal to decimal

There are multiple columns in the df, out of which only selected columns has to be converted from hexa decimal to decimal
Selected column names are stored in a list A = ["Type 2", "Type 4"]
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'CC',
3: '55',
4: '88',
5: '96',
6: 'FF',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}

Say, you have the string "AA" in hex.
You can convert hex to decimal like this:
str(int("AA", 16))
Similarly, for a dataframe column that has hexadecimal values, you can use a lambda function.
df['Type2'] = df['Type2'].apply(lambda x: str(int(str(x), 16)))
Assuming, df is the name of the imported dataframe.

You can use pandas.DataFrame.applymap to cast element-wise:
>>> df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
Type 2 Type 4
0 170 35
1 187 65278
2 204 43981
3 85 56797
4 136 3501
5 150 3326
6 255 53058
7 16777215 801
8 65262 0

Stack related columns

i have the following dataset , which i have extracted from pandas dataframe
{'Batch': {0: 'Nos705', 1: 'Nos706', 2: 'Nos707', 3: 'Nos708', 4: 'Nos709', 5: 'Nos710', 6: 'Nos711', 7: 'Nos713', 8: 'Nos714', 9: 'Nos715'},
'Message': {0: 'ACBB', 1: 'ACBL', 2: 'ACBL', 3: 'ACBC', 4: 'ACBC', 5: 'ACBC', 6: 'ACBL', 7: 'ACBL', 8: 'ACBL', 9: 'ACBL'},
'DCC': {0: 284, 1: 21, 2: 43, 3: 19, 4: 0, 5: 0, 6: 19, 7: 27, 8: 27, 9: 19},
'DCB': {0: 299, 1: 22, 2: 24, 3: 28, 4: 167, 5: 167, 6: 20, 7: 27, 8: 27, 9: 28},
'ISC': {0: 'Car010030', 1: 'Car010054', 2: 'Car010047', 3: 'Car010182', 4: 'Car010004', 5: 'Car010004', 6: 'Car010182', 7: 'Car010182', 8: 'Car010182', 9: 'Car010182'},
'ISB': {0: 'Car010010', 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None},
'VSC': {0: 25, 1: 25, 2: 25, 3: 25, 4: 25, 5: 25, 6: 25, 7: 25, 8: 25, 9: 25},
'VSB': {0: 27.0, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PGC': {0: 2.78, 1: 2.79, 2: 2.08, 3: 2.08, 4: 2.08, 5: 2.08, 6: 2.71, 7: 1.73, 8: 1.73, 9: 1.73},
'PGB': {0: 2.95, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PHB': {0: 2.96, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PHC': {0: 2.94, 1: 2.94, 2: 1.63, 3: 1.63, 4: 1.63, 5: 1.63, 6: 2.06, 7: 1.75, 8: 1.75, 9: 1.75},
'BPC': {0: 3.17, 1: 3.17, 2: 3.17, 3: 3.17, 4: 3.17, 5: 3.17, 6: 3.17, 7: 3.17, 8: 3.17, 9: 3.17},
'BPB': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None}}
I want to create a dataframe which is stacked for related columns
eg: all values of DCC & DCB should appear in one column , one below another. Similarly for ISC & ISB, VSC & VSB, PGC & PCB, PHC & PHB, BPC & BPB
Batch remains the Primary Key here. How do it do it, in Python

First convert columns for repeating to MultiIndex:
df1 = df.set_index(['Batch','Message'])
Then create MultiIndex in columns by all values without last and last values of columns names in MultiIndex.from_arrays and reshape by DataFrame.stack, for correct order add DataFrame.sort_values:
df1.columns = pd.MultiIndex.from_arrays([df1.columns.str[:-1], df1.columns.str[-1]],
names=[None, 'types'])
df1 = (df1.stack(dropna=False)
.reset_index()
.sort_values(['Batch','Message','types'],
ascending=[True, True, False],
ignore_index=True))
print (df1)
Batch Message types BP DC IS PG PH VS
0 Nos705 ACBB C 3.17 284 Car010030 2.78 2.94 25.0
1 Nos705 ACBB B None 299 Car010010 2.95 2.96 27.0
2 Nos706 ACBL C 3.17 21 Car010054 2.79 2.94 25.0
3 Nos706 ACBL B None 22 None NaN NaN NaN
4 Nos707 ACBL C 3.17 43 Car010047 2.08 1.63 25.0
5 Nos707 ACBL B None 24 None NaN NaN NaN
6 Nos708 ACBC C 3.17 19 Car010182 2.08 1.63 25.0
7 Nos708 ACBC B None 28 None NaN NaN NaN
8 Nos709 ACBC C 3.17 0 Car010004 2.08 1.63 25.0
9 Nos709 ACBC B None 167 None NaN NaN NaN
10 Nos710 ACBC C 3.17 0 Car010004 2.08 1.63 25.0
11 Nos710 ACBC B None 167 None NaN NaN NaN
12 Nos711 ACBL C 3.17 19 Car010182 2.71 2.06 25.0
13 Nos711 ACBL B None 20 None NaN NaN NaN
14 Nos713 ACBL C 3.17 27 Car010182 1.73 1.75 25.0
15 Nos713 ACBL B None 27 None NaN NaN NaN
16 Nos714 ACBL C 3.17 27 Car010182 1.73 1.75 25.0
17 Nos714 ACBL B None 27 None NaN NaN NaN
18 Nos715 ACBL C 3.17 19 Car010182 1.73 1.75 25.0
19 Nos715 ACBL B None 28 None NaN NaN NaN
Last if necessary remove types column:
df1 = df1.drop('types', axis=1)

pandas stack/unstack reshaping df with swaplevel

I have df:
pd.DataFrame({'period': {0: pd.Timestamp('2016-05-01 00:00:00'),
1: pd.Timestamp('2017-05-01 00:00:00'),
2: pd.Timestamp('2018-03-01 00:00:00'),
3: pd.Timestamp('2018-04-01 00:00:00'),
4: pd.Timestamp('2016-05-01 00:00:00'),
5: pd.Timestamp('2017-05-01 00:00:00'),
6: pd.Timestamp('2016-03-01 00:00:00'),
7: pd.Timestamp('2016-04-01 00:00:00')},
'cost2': {0: 15,
1: 144,
2: 44,
3: 34,
4: 13,
5: 11,
6: 12,
7: 13},
'rev2': {0: 154,
1: 13,
2: 33,
3: 37,
4: 15,
5: 11,
6: 12,
7: 13},
'cost1': {0: 19,
1: 39,
2: 53,
3: 16,
4: 19,
5: 11,
6: 12,
7: 13},
'rev1': {0: 34,
1: 34,
2: 74,
3: 22,
4: 34,
5: 11,
6: 12,
7: 13},
'destination': {0: 'YYZ',
1: 'YYZ',
2: 'YYZ',
3: 'YYZ',
4: 'DFW',
5: 'DFW',
6: 'DFW',
7: 'DFW'},
'source': {0: 'SFO',
1: 'SFO',
2: 'SFO',
3: 'SFO',
4: 'MIA',
5: 'MIA',
6: 'MIA',
7: 'MIA'}})
df = df[['source','destination','period','rev1','rev2','cost1','cost2']]
which looks like:
I want the final df to have the following columns:
2017-05-01 2016-05-01
source, destination, rev1, rev2, cost1, cost2, rev1, rev2, cost1, cost2...
So essentially, for every source/destination pair, I want revenue and cost numbers for each date in a single row.
I've been tinkering with stack and unstack but haven't been able to achieve my objective.

You can using set_index + unstack, to change the long to wide , then using swaplevel to change the format of columns index you need
df.set_index(['destination','source','period']).unstack().swaplevel(0,1,axis=1).sort_index(level=0,axis=1)

An alternative to .set_index + .unstack is .pivot_table:
df.pivot_table( \
index=['source', 'destination'], \
columns=['period'], \
values=['rev1', 'rev2', 'cost1', 'cost2'] \
).swaplevel(axis=1).sort_index(axis=1, level=0)
# period 2016-03-01 2016-04-01 ...
# cost1 cost2 rev1 rev2 cost1 cost2 rev1 rev2
# source destination
# MIA DFW 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0
# SFO YYZ NaN NaN NaN NaN NaN NaN NaN NaN

replicate iferror and vlookup in a pandas join

I want to join two dataframes:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Country ': {0: 'de', 1: 'it', 2: 'de'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
df2 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1',1: 'campaign2', 2: 'none',3: 'campaign4',4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
edit:
let's even imagine the option:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
I have to join df2 and df1 on the keys:
Date
Campaign
Banner
The issue here is that when the match under the key "Campaign" is not found, the key should be switched to field "id_campaign".
I would like to obtain this dataframe:
df_joined = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: 'none', 3: 'campaign4', 4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20, 3: 0, 4: 0},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
any help is really appreciated.

You can use double merge by 3 and 2 keys and then fill not match values by combine_first from column Value_1 of df4:
df3 = pd.merge(df2, df1.drop('Country', axis=1), on=['Date','Campaign','Banner'], how='left')
df4 = pd.merge(df2, df1, on=['Date','Banner'], how='left')
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10.0
1 banner2 campaign2 it 2/1/2016 10 none 5.0
2 banner3 none de 1/1/2016 15 12345 NaN
3 banner4 campaign4 en 3/1/2016 20 none NaN
4 banner5 campaign5 en 4/1/2016 25 none NaN
print (df4['Value_1'])
0 10.0
1 5.0
2 20.0
3 NaN
4 NaN
Name: Value_1, dtype: float64
df3['Value_1'] = df3['Value_1'].combine_first(df4['Value_1']).fillna(0).astype(int)
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10
1 banner2 campaign2 it 2/1/2016 10 none 5
2 banner3 none de 1/1/2016 15 12345 20
3 banner4 campaign4 en 3/1/2016 20 none 0
4 banner5 campaign5 en 4/1/2016 25 none 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby apply is taking too much time - python

Related

how to compare two dataframes by multiple columns and only append new entries in pandas?

How to convert mulitple columns of a df from hexadecimal to decimal

Stack related columns

pandas stack/unstack reshaping df with swaplevel

replicate iferror and vlookup in a pandas join

Categories

Resources