Transaction data analysis - python

I have a transactions data frame (15k lines):
customer_id order_id order_date var1 var2 product_id \
79 822067 1990-10-21 0 0 51818
79 771456 1990-11-29 0 0 580866
79 771456 1990-11-29 0 0 924147
156 720709 1990-06-08 0 0 167205
156 720709 1990-06-08 0 0 132120
product_type_id designer_id gross_spend net_spend
139 322 0.174 0.174
139 2366 1.236 1.236
432 919 0.205 0.205
474 4792 0.374 0.374
164 2243 0.278 0.278
I'd like to group by product_type_id and time bin of a transaction for each customer. To be more clear for each customer_id I'd like know how many times the customer bought from the same category in the last 30, 60, 90, 120, 150, 180, 360 days in the past (from date 1991-01-01 for example).
For each customer also I'd like to have how many total purchases he's made, from how many different distinct product_type_id he's bought the total net_spend.
It is not clear to me how to reduce the data as a flat pandas data frame with one line per customer_id....
I can a simplifiead view with something like:
transactions['order_date'] = transactions['order_date'].apply(lambda x: dt.datetime.strptime(x,"%Y-%m-%d"))
NOW = dt.datetime(1991,01,01)
Table = transactions.groupby('customer_id').agg({ 'order_date': lambda x: (NOW - x.max()).days,'order_id': lambda x: len(set(x)), 'net_spend': lambda x: x.sum()})
Table.rename(columns={'order_date': 'Recency', 'order_id': 'Frequency', 'net_spend': 'Monetization'}, inplace=True)

Use:
date = '1991-01-01'
last = [30,60,90]
#get all last datetimes shifted by last
a = [pd.to_datetime(date)- pd.Timedelta(x, unit='d') for x in last]
d1 = {}
#create new columns by conditions with between
for i, x in enumerate(a):
df['last_' + str(last[i])] = df['order_date'].between(x, date).astype(int)
#create dictionary for aggregate
d1['last_' + str(last[i])] = 'sum'
#aggregating dictionary
d = {'customer_id':'size', 'product_type_id':'nunique', 'net_spend':'sum'}
#add d1 to d
d.update(d1)
print (d)
{'product_type_id': 'nunique', 'last_30': 'sum', 'net_spend': 'sum',
'last_60': 'sum', 'customer_id': 'size', 'last_90': 'sum'}
df1 = df.groupby('customer_id').agg(d)
#change order of columns if necessary
cs = df1.columns
m = cs.str.startswith('last')
cols = cs[~m].tolist() + cs[m].tolist()
df1 = df1.reindex(columns=cols)
print (df1)
product_type_id net_spend customer_id last_30 last_60 \
customer_id
79 2 1.615 3 0 2
156 2 0.652 2 0 0
last_90
customer_id
79 3
156 0

Related

Find How many customers returned after 1year of inactivity,2year of inactivity etc

I have a transaction data frame as follows.
customer_id order_date product_id value($)
139 2015-07-08 A 0.174
139 2015-06-08 B 1.236
432 2017-08-09 E 0.205
474 2019-08-27 A 0.374
164 2022-05-08 D 0.278
How do I find how many customers returned after one year (365 days) of inactivity, two years (730 days) of inactivity etc.?
Preparations:
df['order_date'] = pd.to_datetime(df['order_date'])
inact_period = pd.Timedelta('365 days')
Solution:
def f(ser, inact_period):
ser = ser.sort_values()
return (ser - ser.shift() > inact_period).any()
n = df.groupby('customer_id')['order_date'].agg(f, inact_period=inact_period).sum()
n is an integer - the result.
If one customer that has multiple inactivity periods should count multiple times, then you can replace .any() with .sum().
Example:
df = pd.DataFrame({'customer_id': [1, 1, 1, 1, 2],
'order_date': pd.to_datetime(['2015', '2016', '2017', '2019', '2022'])})
customer_id order_date
0 139 2015-01-01
1 139 2016-01-01
2 139 2017-01-01
3 139 2019-01-01
4 164 2022-01-01
With .sum():
If I set inact_period = pd.Timedelta('365 days'), then n == 2 (since 2016 is a leap year there are 366 days between 2016-01-01 and 2017-01-01).
If I set inact_period = pd.Timedelta('366 days'), then n == 1.
With .any():
If I set inact_period = pd.Timedelta('1 d'), then n == 1.

merge dataframes and add price data for each instance of an item ID

I am trying to merge two dataframes so that each instance of an item ID in DF3 displays the pricing data associated with the matching ID from DF1.
DF3 (what I am trying to accomplish)
recipeID
itemID_out
qty_out
buy_price
sell_price
buy_quantity
sell_quantity
id_1_in
qty_id1
buy_price
sell_price
buy_quantity
sell_quantity
id_2_in
qty_id2
buy_price
sell_price
buy_quantity
sell_quantity
id_3_in
qty_id3
buy_price
sell_price
buy_quantity
sell_quantity
id_4_in
qty_id4
buy_price
sell_price
buy_quantity
sell_quantity
id_5_in
qty_id5
buy_price
sell_price
buy_quantity
sell_quantity
1
1986
1
129
167
67267
21637
123
1
10
15
1500
3000
124
1
12
14
550
800
125
1
8
12
124
254
126
1
22
25
1251
890
127
1
64
72
12783
1251515
2
1987
1
1521
1675
654
1245
123
2
10
15
1500
3000
3
1988
1
128376
131429
47
23
123
10
10
15
1500
3000
124
3
12
14
550
800
These are the two dataframes I am trying to merge from.
DF1: Contains 26863 rows; master list of item names, IDs, and price data. Pulled from API, new items can be added and will appear as new rows after an update request from the user.
itemID
name
buy_price
sell_price
buy_quantity
sell_quantity
1986
XYZ
129
167
67267
21637
123
ABC
10
15
1500
3000
124
DEF
12
14
550
800
DF2 (contains 12784 rows; recipes that combine from items in the master list. Pulled from API, new recipes can be added and will appear as new rows after an update request from the user.)
recipeID
itemID_out
qty_out
id_1_in
qty_id1
id_2_in
qty_id2
id_3_in
qty_id3
id_4_in
qty_id4
id_5_in
qty_id5
1
1986
1
123
1
124
1
125
1
126
1
127
1
2
1987
1
123
2
3
1988
1
123
10
124
3
Recipes can contain a combination of 1 to 5 items (null values occur) that consist of IDs from DF1 and/or the itemID_out column in DF2.
The "id_#_in" columns in DF2 can contain item IDs from the "itemID_out" column, due to that recipe using the item that is being output from another recipe.
I have tried to merge it using:
pd.merge(itemlist_modified, recipelist_modified, left_on='itemID', right_on='itemID_out')
But this only ever results in a single column of ideas receiving the pricing data as intended.
I feel like I'm trying to use the wrong function for this, any help would be very much appreciated!
Thanks in advance!
Not a pretty approach, but it first melts the ingredient table into long form and then merges it on the itemlist table
import pandas as pd
import numpy as np
itemlist_modified = pd.DataFrame({
'itemID': [1986, 123, 124],
'name': ['XYZ', 'ABC', 'DEF'],
'buy_price': [129, 10, 12],
'sell_price': [167, 15, 14],
'buy_quantity': [67267, 1500, 550],
'sell_quantity': [21637, 3000, 800],
})
recipelist_modified = pd.DataFrame({
'RecipeID': [1, 2, 3],
'itemID_out': [1986, 1987, 1988],
'qty_out': [1, 1, 1],
'id_1_in': [123, 123, 123],
'qty_id1': [1, 2, 10],
'id_2_in': [124.0, np.nan, 124.0],
'qty_id2': [1.0, np.nan, 3.0],
'id_3_in': [125.0, np.nan, np.nan],
'qty_id3': [1.0, np.nan, np.nan],
'id_4_in': [126.0, np.nan, np.nan],
'qty_id4': [1.0, np.nan, np.nan],
'id_5_in': [127.0, np.nan, np.nan],
'qty_id5': [1.0, np.nan, np.nan],
})
#columns which are not qty or input id cols
id_vars = ['RecipeID','itemID_out','qty_out']
#prepare dict to map column name to ingredient number
col_renames = {}
col_renames.update({'id_{}_in'.format(i+1):'ingr_{}'.format(i+1) for i in range(5)})
col_renames.update({'qty_id{}'.format(i+1):'ingr_{}'.format(i+1) for i in range(5)})
#melt reciplist into longform
long_recipelist = recipelist_modified.melt(
id_vars=id_vars,
var_name='ingredient',
).dropna()
#add a new column to specify whether each row is a qty or an id
long_recipelist['kind'] = np.where(long_recipelist['ingredient'].str.contains('qty'),'qty_in','id_in')
#convert ingredient names
long_recipelist['ingredient'] = long_recipelist['ingredient'].map(col_renames)
#pivot on the new ingredient column
reshape_recipe_list = long_recipelist.pivot(
index=['RecipeID','itemID_out','qty_out','ingredient'],
columns='kind',
values='value',
).reset_index()
#merge with the itemlist
priced_ingredients = pd.merge(reshape_recipe_list, itemlist_modified, left_on='id_in', right_on='itemID')
#pivot on the priced ingredients
priced_ingredients = priced_ingredients.pivot(
index = ['RecipeID','itemID_out','qty_out'],
columns = 'ingredient',
)
#flatten the hierarchical columns
priced_ingredients.columns = ["_".join(a[::-1]) for a in priced_ingredients.columns.to_flat_index()]
priced_ingredients.columns.name = ''
priced_ingredients = priced_ingredients.reset_index()
priced_ingredients partial output:

Iterating through a dataframe to pull mins and max's as well as other columns based off of those values

I'm fairly new to python and very new to pandas so any help would be appreciated!
I have a dataframe where the data is structured like below:
Batch_Name
Tag 1
Tag 2
2019-01
1
3
2019-02
2
3
I want to iterate through the dataframe and pull the following into a new dataframe:
The max value for each tag (there are 5 in my full data frame)
The batch name at that max value
The min value for that tag
The batch name at that min value
The average for that tag
The std for that tag
I have had a lot of trouble trying to mentally structure this, but I run into errors even trying to create the dataframe with the summary statistics. Below is my first attempt at creating a new method with the stats, I wasn't sure how to pull the batch names at all.
def tag_stats(df):
min_col = {}
min_col_batch = {}
max_col = {}
max_col_batch = {}
std_col = {}
avg_col = {}
for col in range(df.shape[3:]):
max_col[col]= df[col].max()
min_col[col]= df[col].min()
std_col[col]= df[col].std()
avg_col[col]= df[col].avg()
result = pd.DataFrame([min_col, max_col, std_col, avg_col], index=['min', 'max', 'std', 'avg'])
return result
Here is an answer based on your code!
import pandas as pd
import numpy as np
#Slightly modified your function
def tag_stats(df, tag_list):
df = df.set_index('Batch_Name')
data = {
'tag':[],
'min':[],
'max':[],
'min_batch':[],
'max_batch':[],
'std':[],
'mean':[],
}
for tag in tag_list:
values = df[tag]
data['tag'].append(tag)
data['min'].append(values.min())
data['max'].append(values.max())
data['min_batch'].append(values.idxmin())
data['max_batch'].append(values.idxmax())
data['std'].append(values.std())
data['mean'].append(values.mean())
result = pd.DataFrame(data)
return result
#Create a df using some random data
np.random.seed(1)
num_batches = 10
df = pd.DataFrame({
'Batch_Name':['batch_{}'.format(i) for i in range(num_batches)],
'Tag 1':np.random.randint(1,100,num_batches),
'Tag 2':np.random.randint(1,100,num_batches),
'Tag 3':np.random.randint(1,100,num_batches),
'Tag 4':np.random.randint(1,100,num_batches),
'Tag 5':np.random.randint(1,100,num_batches),
})
#Apply your function
cols = ['Tag 1','Tag 2','Tag 3','Tag 4','Tag 5']
summary_df = tag_stats(df, cols)
print(summary_df)
Output
tag min max min_batch max_batch std mean
0 Tag 1 2 80 batch_9 batch_6 32.200759 38.0
1 Tag 2 7 85 batch_2 batch_7 28.926919 39.9
2 Tag 3 14 97 batch_9 batch_7 33.297314 63.4
3 Tag 4 1 82 batch_7 batch_9 31.060693 37.1
4 Tag 5 4 89 batch_7 batch_1 31.212711 43.3
The comment for #It_is_Chris is great too, here is an answer based on it
import pandas as pd
import numpy as np
#Create a df using some random data
np.random.seed(1)
num_batches = 10
df = pd.DataFrame({
'Batch_Name':['batch_{}'.format(i) for i in range(num_batches)],
'Tag 1':np.random.randint(1,100,num_batches),
'Tag 2':np.random.randint(1,100,num_batches),
'Tag 3':np.random.randint(1,100,num_batches),
'Tag 4':np.random.randint(1,100,num_batches),
'Tag 5':np.random.randint(1,100,num_batches),
})
#Convert to a long df and index by Batch_Name:
# index | tag | tag_value
# ------------------------------------
# batch_0 | Tag 1 38 | 38
# batch_1 | Tag 1 13 | 13
# batch_2 | Tag 1 73 | 73
long_df = df.melt(
id_vars = 'Batch_Name',
var_name = 'tag',
value_name = 'tag_value',
).set_index('Batch_Name')
#Groupby tag and aggregate to get columns of interest
summary_df = long_df.groupby('tag').agg(
max_value = ('tag_value','max'),
max_batch = ('tag_value','idxmax'),
min_value = ('tag_value','min'),
min_batch = ('tag_value','idxmin'),
mean_value = ('tag_value','mean'),
std_value = ('tag_value','std'),
).reset_index()
summary_df
Output:
tag max_value max_batch min_value min_batch mean_value std_value
0 Tag 1 80 batch_6 2 batch_9 38.0 32.200759
1 Tag 2 85 batch_7 7 batch_2 39.9 28.926919
2 Tag 3 97 batch_7 14 batch_9 63.4 33.297314
3 Tag 4 82 batch_9 1 batch_7 37.1 31.060693
4 Tag 5 89 batch_1 4 batch_7 43.3 31.212711

How to group by and aggregate on multiple columns in pandas

I have following dataframe in pandas
ID Balance ATM_drawings Value
1 100 50 345
1 150 33 233
2 100 100 333
2 100 100 234
I want data in that desired format
ID Balance_mean Balance_sum ATM_Drawings_mean ATM_drawings_sum
1 75 250 41.5 83
2 200 100 200 100
I am using following command to do it in pandas
df1= df[['Balance','ATM_drawings']].groupby('ID', as_index = False).agg(['mean', 'sum']).reset_index()
But, it does not give what I intended to get.
You can use a dictionary to specify aggregation functions for each series:
d = {'Balance': ['mean', 'sum'], 'ATM_drawings': ['mean', 'sum']}
res = df.groupby('ID').agg(d)
# flatten MultiIndex columns
res.columns = ['_'.join(col) for col in res.columns.values]
print(res)
Balance_mean Balance_sum ATM_drawings_mean ATM_drawings_sum
ID
1 125 250 41.5 83
2 100 200 100.0 200
Or you can define d via dict.fromkeys:
d = dict.fromkeys(('Balance', 'ATM_drawings'), ['mean', 'sum'])
Not sure how to achieve this using agg, but you could reuse the `groupby´ object to avoid having to do the operation multiple times, and then use transformations:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 1, 2, 2],
"Balance": [100, 150, 100, 100],
"ATM_drawings": [50, 33, 100, 100],
"Value": [345, 233, 333, 234]
})
gb = df.groupby("ID")
df["Balance_mean"] = gb["Balance"].transform("mean")
df["Balance_sum"] = gb["Balance"].transform("sum")
df["ATM_drawings_mean"] = gb["ATM_drawings"].transform("mean")
df["ATM_drawings_sum"] = gb["ATM_drawings"].transform("sum")
print df
Which yields:
ID Balance Balance_mean Balance_sum ATM_drawings ATM_drawings_mean ATM_drawings_sum Value
0 1 100 125 250 50 41.5 83 345
1 1 150 125 250 33 41.5 83 233
2 2 100 100 200 100 100.0 200 333
3 2 100 100 200 100 100.0 200 234

pandas: how to run a pivot with a multi-index?

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
Minimal code copied below. Thanks a lot!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100
You can group and then unstack.
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
I believe if you include item in your MultiIndex, then you can just unstack:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.
The following worked for me:
mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]
thanks to gmoutso comment you can use this:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers

Categories

Resources