Pandas dataframe vectorized bucketing/aggregation? - python

The Task
I have a dataframe that looks like this:
date
money_spent ($)
meals_eaten
weight
2021-01-01 10:00:00
350
5
140
2021-01-02 18:00:00
250
2
170
2021-01-03 12:10:00
200
3
160
2021-01-04 19:40:00
100
1
150
I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.
So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
first date of the cut
average meals_eaten
minimum weight
maximum weight
So the final table would be two rows like this:
date
cumulative_spent ($)
meals_eaten
min_weight
max_weight
2021-01-01 10:00:00
600
3.5
140
170
2021-01-03 12:10:00
300
2
150
160
My Approach:
My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.
Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.
Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.

The customized function to achieve the cumsum with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
#njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total += y
if total >= lim:
total = 0
check = 1
result.append(check)
return result

Related

Faster alternative for implementing this pandas solution

I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:

How to get the sum of change of one column while a condition on another column is true?

I need to evaluate the change of one Pandas column that occurs while another column fulfills a certain condition.
Assuming a DataFrame df with a DateTimeIndex and two columns:
timestamp operating_time sensor_values
2022-03-23 23:57:59.802000+00:00 8.172000e+06 398.15
2022-03-23 23:57:59.818000+00:00 8.172000e+06 397.85
2022-03-23 23:58:59.805000+00:00 8.172000e+06 397.5
2022-03-23 23:58:59.821000+00:00 8.172000e+06 NaN
2022-03-23 23:59:59.793000+00:00 8.172000e+06 397.15
...
Now I would like to know how much operating_time passed while sensor_values < 398 and how much operating_time passed while sensor_values >= 398.
I tried to divide the data into to DataFrames like this:
df_low = df[df['sensor_values'] < 398]
df_high = df[df['sensor_values'] >= 398]
However if I then calculate by how much the operating_time changes for each DataFrame with
df_low['operating_time'].diff().sum()
df_high['operating_time'].diff().sum()
I get basically the same value as it seems Pandas is filling the NaN gaps for diff().
How can I find out how much operating time sensor_values was above and below a certain value?
Plot of sensor_values in blue and operating_time in red:
The expected output would be two numbers representing the operating time that was spend over the threshold and below the threshold. So in the example image the operating time increases from roughly 1e6 minutes to roughly 8e6 minutes. The two numbers should therefore add up to 7e6 minutes.
IIUC, you could use:
out = (df['operating_time']
.diff()
.groupby(np.where(df['sensor_values'].gt(398), '>398', '≤398'))
.sum()
)
output (here with limited example):
>398 0.0
≤398 0.0
Name: operating_time, dtype: float64
or, directly from the timestamps:
out = (pd.to_datetime(df['timestamp'])
.diff()
.groupby(np.where(df['sensor_values'].gt(398), '>398', '≤398'))
.sum()
)
output:
>398 0 days 00:00:00
≤398 0 days 00:01:59.991000
Name: timestamp, dtype: timedelta64[ns]

Cumulative sales data with threshold value forming a new series / column with a boolean value?

I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0

Pandas: Group by, Cumsum + Shift with a "where clause"

I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Categories

Resources