I am having the following code.
pd.DataFrame({'user_wid': {0: 3305613, 1: 57, 2: 80, 3: 31, 4: 38, 5: 12, 6: 35, 7: 25, 8: 42, 9: 16}, 'user_name': {0: 'Ter', 1: 'Am', 2: 'Wi', 3: 'Ma', 4: 'St', 5: 'Ju', 6: 'De', 7: 'Ri', 8: 'Ab', 9: 'Ti'}, 'user_age': {0: 41, 1: 34, 2: 45, 3: 47, 4: 70, 5: 64, 6: 64, 7: 63, 8: 32, 9: 24}, 'user_gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Male', 5: 'Female', 6: 'Female', 7: 'Female', 8: 'Female', 9: 'Female'}, 'sale_date': {0: '2018-05-15', 1: '2020-02-28', 2: '2020-04-02', 3: '2020-05-09', 4: '2020-11-29', 5: '2020-12-14', 6: '2020-04-21', 7: '2020-06-15', 8: '2020-07-03', 9: '2020-08-10'}, 'days_since_first_visit': {0: 426, 1: 0, 2: 0, 3: 8, 4: 126, 5: 283, 6: 0, 7: 189, 8: 158, 9: 270}, 'visit': {0: 4, 1: 1, 2: 1, 3: 2, 4: 4, 5: 3, 6: 1, 7: 2, 8: 4, 9: 2}, 'num_user_visits': {0: 4, 1: 2, 2: 1, 3: 2, 4: 10, 5: 7, 6: 1, 7: 4, 8: 4, 9: 2}, 'product': {0: 13, 1: 2, 2: 2, 3: 2, 4: 5, 5: 5, 6: 1, 7: 8, 8: 5, 9: 4}, 'sale_price': {0: 10.0, 1: 0.0, 2: 41.3, 3: 41.3, 4: 49.95, 5: 74.95, 6: 49.95, 7: 5.0, 8: 0.0, 9: 0.0}, 'whether_member': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}})
def f(x):
d = {}
d['user_name'] = x['user_name'].max()
d['user_age'] = x['user_age'].max()
d['user_gender'] = x['user_gender'].max()
d['last_visit_date'] = x['sale_date'].max()
d['days_since_first_visit'] = x['days_since_first_visit'].max()
d['num_visits_window'] = x['visit'].max()
d['num_visits_total'] = x['num_user_visits'].max()
d['products_used'] = x['product'].max()
d['user_total_sales'] = (x['sale_price'].sum()).round(2)
d['avg_spend_visit'] = (x['sale_price'].sum() / x['visit'].max()).round(2)
d['membership'] = x['whether_member'].max()
return pd.Series(d)
users = xactions.groupby('user_wid').apply(f).reset_index()
It is taking too much time to execute, I want to optimize the following function.
Any suggestions would be appreciated.
Thanks in advance.
Try:
users2 = xactions.groupby("user_wid", as_index=False).agg(
user_name=("user_name", "max"),
user_age=("user_age", "max"),
user_gender=("user_gender", "max"),
last_visit_date=("sale_date", "max"),
days_since_first_visit=("days_since_first_visit", "max"),
num_visits_window=("visit", "max"),
num_visits_total=("num_user_visits", "max"),
products_used=("product", "max"),
user_total_sales=("sale_price", "sum"),
membership=("whether_member", "max"),
)
users2["avg_spend_visit"] = (
users2["user_total_sales"] / users2["num_visits_window"]
).round(2)
print(users2)
Prints:
user_wid user_name user_age user_gender last_visit_date days_since_first_visit num_visits_window num_visits_total products_used user_total_sales membership avg_spend_visit
0 12 Ju 64 Female 2020-12-14 283 3 7 5 74.95 0 24.98
1 16 Ti 24 Female 2020-08-10 270 2 2 4 0.00 0 0.00
2 25 Ri 63 Female 2020-06-15 189 2 4 8 5.00 0 2.50
3 31 Ma 47 Male 2020-05-09 8 2 2 2 41.30 0 20.65
4 35 De 64 Female 2020-04-21 0 1 1 1 49.95 0 49.95
5 38 St 70 Male 2020-11-29 126 4 10 5 49.95 0 12.49
6 42 Ab 32 Female 2020-07-03 158 4 4 5 0.00 0 0.00
7 57 Am 34 Female 2020-02-28 0 1 2 2 0.00 0 0.00
8 80 Wi 45 Male 2020-04-02 0 1 1 2 41.30 0 41.30
9 3305613 Ter 41 Male 2018-05-15 426 4 4 13 10.00 0 2.50
Related
I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.
So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?
Input data:
>>> src_df.to_dict()
{'yyyy_mm': {0: 202001,
1: 202002,
2: 202003,
3: 202002,
4: 202107,
5: 202108,
6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}
>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
1: 202002,
2: 202001,
3: 201901,
4: 201907,
5: 201908,
6: 201901,
7: 201907,
8: 201908},
'state': {0: 'CA',
1: 'NJ',
2: 'NY',
3: 'NY',
4: 'NY',
5: 'NY',
6: 'PA',
7: 'PA',
8: 'PA'},
'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}
My current attempt:
picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]
>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 6: 7}}
Then I tried to do following but output is not same as my expected output:
output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1) # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer') # second attempt
but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?
objective
I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?
Here is my desired output that I want to get:
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7
You should change your picked_rows DataFrame to only include dates that are greater than the hist_df dates:
#keep only rows that are newer than in hist_df
new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"].max()))]
#of the new rows, keep the latest updated values
picked_rows = new_data.loc[new_data.groupby("state")["yyyy_mm"].idxmax()]
#concat to hist_df
output_df = pd.concat([hist_df, picked_rows], ignore_index=True)
>>> output_df
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7
There are multiple columns in the df, out of which only selected columns has to be converted from hexa decimal to decimal
Selected column names are stored in a list A = ["Type 2", "Type 4"]
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'CC',
3: '55',
4: '88',
5: '96',
6: 'FF',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
Say, you have the string "AA" in hex.
You can convert hex to decimal like this:
str(int("AA", 16))
Similarly, for a dataframe column that has hexadecimal values, you can use a lambda function.
df['Type2'] = df['Type2'].apply(lambda x: str(int(str(x), 16)))
Assuming, df is the name of the imported dataframe.
You can use pandas.DataFrame.applymap to cast element-wise:
>>> df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
Type 2 Type 4
0 170 35
1 187 65278
2 204 43981
3 85 56797
4 136 3501
5 150 3326
6 255 53058
7 16777215 801
8 65262 0
i have the following dataset , which i have extracted from pandas dataframe
{'Batch': {0: 'Nos705', 1: 'Nos706', 2: 'Nos707', 3: 'Nos708', 4: 'Nos709', 5: 'Nos710', 6: 'Nos711', 7: 'Nos713', 8: 'Nos714', 9: 'Nos715'},
'Message': {0: 'ACBB', 1: 'ACBL', 2: 'ACBL', 3: 'ACBC', 4: 'ACBC', 5: 'ACBC', 6: 'ACBL', 7: 'ACBL', 8: 'ACBL', 9: 'ACBL'},
'DCC': {0: 284, 1: 21, 2: 43, 3: 19, 4: 0, 5: 0, 6: 19, 7: 27, 8: 27, 9: 19},
'DCB': {0: 299, 1: 22, 2: 24, 3: 28, 4: 167, 5: 167, 6: 20, 7: 27, 8: 27, 9: 28},
'ISC': {0: 'Car010030', 1: 'Car010054', 2: 'Car010047', 3: 'Car010182', 4: 'Car010004', 5: 'Car010004', 6: 'Car010182', 7: 'Car010182', 8: 'Car010182', 9: 'Car010182'},
'ISB': {0: 'Car010010', 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None},
'VSC': {0: 25, 1: 25, 2: 25, 3: 25, 4: 25, 5: 25, 6: 25, 7: 25, 8: 25, 9: 25},
'VSB': {0: 27.0, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PGC': {0: 2.78, 1: 2.79, 2: 2.08, 3: 2.08, 4: 2.08, 5: 2.08, 6: 2.71, 7: 1.73, 8: 1.73, 9: 1.73},
'PGB': {0: 2.95, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PHB': {0: 2.96, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PHC': {0: 2.94, 1: 2.94, 2: 1.63, 3: 1.63, 4: 1.63, 5: 1.63, 6: 2.06, 7: 1.75, 8: 1.75, 9: 1.75},
'BPC': {0: 3.17, 1: 3.17, 2: 3.17, 3: 3.17, 4: 3.17, 5: 3.17, 6: 3.17, 7: 3.17, 8: 3.17, 9: 3.17},
'BPB': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None}}
I want to create a dataframe which is stacked for related columns
eg: all values of DCC & DCB should appear in one column , one below another. Similarly for ISC & ISB, VSC & VSB, PGC & PCB, PHC & PHB, BPC & BPB
Batch remains the Primary Key here. How do it do it, in Python
First convert columns for repeating to MultiIndex:
df1 = df.set_index(['Batch','Message'])
Then create MultiIndex in columns by all values without last and last values of columns names in MultiIndex.from_arrays and reshape by DataFrame.stack, for correct order add DataFrame.sort_values:
df1.columns = pd.MultiIndex.from_arrays([df1.columns.str[:-1], df1.columns.str[-1]],
names=[None, 'types'])
df1 = (df1.stack(dropna=False)
.reset_index()
.sort_values(['Batch','Message','types'],
ascending=[True, True, False],
ignore_index=True))
print (df1)
Batch Message types BP DC IS PG PH VS
0 Nos705 ACBB C 3.17 284 Car010030 2.78 2.94 25.0
1 Nos705 ACBB B None 299 Car010010 2.95 2.96 27.0
2 Nos706 ACBL C 3.17 21 Car010054 2.79 2.94 25.0
3 Nos706 ACBL B None 22 None NaN NaN NaN
4 Nos707 ACBL C 3.17 43 Car010047 2.08 1.63 25.0
5 Nos707 ACBL B None 24 None NaN NaN NaN
6 Nos708 ACBC C 3.17 19 Car010182 2.08 1.63 25.0
7 Nos708 ACBC B None 28 None NaN NaN NaN
8 Nos709 ACBC C 3.17 0 Car010004 2.08 1.63 25.0
9 Nos709 ACBC B None 167 None NaN NaN NaN
10 Nos710 ACBC C 3.17 0 Car010004 2.08 1.63 25.0
11 Nos710 ACBC B None 167 None NaN NaN NaN
12 Nos711 ACBL C 3.17 19 Car010182 2.71 2.06 25.0
13 Nos711 ACBL B None 20 None NaN NaN NaN
14 Nos713 ACBL C 3.17 27 Car010182 1.73 1.75 25.0
15 Nos713 ACBL B None 27 None NaN NaN NaN
16 Nos714 ACBL C 3.17 27 Car010182 1.73 1.75 25.0
17 Nos714 ACBL B None 27 None NaN NaN NaN
18 Nos715 ACBL C 3.17 19 Car010182 1.73 1.75 25.0
19 Nos715 ACBL B None 28 None NaN NaN NaN
Last if necessary remove types column:
df1 = df1.drop('types', axis=1)
I have df:
pd.DataFrame({'period': {0: pd.Timestamp('2016-05-01 00:00:00'),
1: pd.Timestamp('2017-05-01 00:00:00'),
2: pd.Timestamp('2018-03-01 00:00:00'),
3: pd.Timestamp('2018-04-01 00:00:00'),
4: pd.Timestamp('2016-05-01 00:00:00'),
5: pd.Timestamp('2017-05-01 00:00:00'),
6: pd.Timestamp('2016-03-01 00:00:00'),
7: pd.Timestamp('2016-04-01 00:00:00')},
'cost2': {0: 15,
1: 144,
2: 44,
3: 34,
4: 13,
5: 11,
6: 12,
7: 13},
'rev2': {0: 154,
1: 13,
2: 33,
3: 37,
4: 15,
5: 11,
6: 12,
7: 13},
'cost1': {0: 19,
1: 39,
2: 53,
3: 16,
4: 19,
5: 11,
6: 12,
7: 13},
'rev1': {0: 34,
1: 34,
2: 74,
3: 22,
4: 34,
5: 11,
6: 12,
7: 13},
'destination': {0: 'YYZ',
1: 'YYZ',
2: 'YYZ',
3: 'YYZ',
4: 'DFW',
5: 'DFW',
6: 'DFW',
7: 'DFW'},
'source': {0: 'SFO',
1: 'SFO',
2: 'SFO',
3: 'SFO',
4: 'MIA',
5: 'MIA',
6: 'MIA',
7: 'MIA'}})
df = df[['source','destination','period','rev1','rev2','cost1','cost2']]
which looks like:
I want the final df to have the following columns:
2017-05-01 2016-05-01
source, destination, rev1, rev2, cost1, cost2, rev1, rev2, cost1, cost2...
So essentially, for every source/destination pair, I want revenue and cost numbers for each date in a single row.
I've been tinkering with stack and unstack but haven't been able to achieve my objective.
You can using set_index + unstack, to change the long to wide , then using swaplevel to change the format of columns index you need
df.set_index(['destination','source','period']).unstack().swaplevel(0,1,axis=1).sort_index(level=0,axis=1)
An alternative to .set_index + .unstack is .pivot_table:
df.pivot_table( \
index=['source', 'destination'], \
columns=['period'], \
values=['rev1', 'rev2', 'cost1', 'cost2'] \
).swaplevel(axis=1).sort_index(axis=1, level=0)
# period 2016-03-01 2016-04-01 ...
# cost1 cost2 rev1 rev2 cost1 cost2 rev1 rev2
# source destination
# MIA DFW 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0
# SFO YYZ NaN NaN NaN NaN NaN NaN NaN NaN
I want to join two dataframes:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Country ': {0: 'de', 1: 'it', 2: 'de'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
df2 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1',1: 'campaign2', 2: 'none',3: 'campaign4',4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
edit:
let's even imagine the option:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
I have to join df2 and df1 on the keys:
Date
Campaign
Banner
The issue here is that when the match under the key "Campaign" is not found, the key should be switched to field "id_campaign".
I would like to obtain this dataframe:
df_joined = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: 'none', 3: 'campaign4', 4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20, 3: 0, 4: 0},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
any help is really appreciated.
You can use double merge by 3 and 2 keys and then fill not match values by combine_first from column Value_1 of df4:
df3 = pd.merge(df2, df1.drop('Country', axis=1), on=['Date','Campaign','Banner'], how='left')
df4 = pd.merge(df2, df1, on=['Date','Banner'], how='left')
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10.0
1 banner2 campaign2 it 2/1/2016 10 none 5.0
2 banner3 none de 1/1/2016 15 12345 NaN
3 banner4 campaign4 en 3/1/2016 20 none NaN
4 banner5 campaign5 en 4/1/2016 25 none NaN
print (df4['Value_1'])
0 10.0
1 5.0
2 20.0
3 NaN
4 NaN
Name: Value_1, dtype: float64
df3['Value_1'] = df3['Value_1'].combine_first(df4['Value_1']).fillna(0).astype(int)
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10
1 banner2 campaign2 it 2/1/2016 10 none 5
2 banner3 none de 1/1/2016 15 12345 20
3 banner4 campaign4 en 3/1/2016 20 none 0
4 banner5 campaign5 en 4/1/2016 25 none 0