Multiply pandas dataframe by vlookup - python

I have a very large dataframe with multiple years of sales data and tens of thousands of skew_ids (i.e.):
date skew_id units_sold
0 2001-01-01 123 1
1 2001-01-02 123 2
2 2001-01-03 123 3
3 2001-01-01 456 4
4 2001-01-02 456 5
...
I have another dataframe that maps skew_ids to skew_price (i.e.):
skew_id skew_price
0 123 100.00
1 456 10.00
...
My first dataframe is so large that I cannot merge without hitting my memory limit.
I'd like to calculate the daily revenues (i.e.):
date revenue
0 2001-01-01 140
1 2001-01-02 250
2 2001-01-03 300
...

I think it depends of number of rows, number of unique skew_id values and size of RAM.
One possible solution with map:
df1['revenue'] = df1['skew_id'].map(df2.set_index('skew_id')['skew_price']) * df1['units_sold']
df2 = df1.groupby('date', as_index=False)['revenue'].sum()

You could achieve this with a groupby:
df.groupby('date').apply(lambda gr: df2.loc[df2.skew_id.isin(list(gr.skew_id))]['skew_price'].sum())
Or if you run into memory problems you could loop over all dates yourself. This is slower, but might need less memory.
revenue = []
for d in df.date.unique():
r = df2.loc[df2.skew_id.isin(list(df.loc[df.date == d].skew_id))]['skew_price'].sum()
revenue.append({'date': d, 'revenue': r})
pd.DataFrame(revenue)

Related

Sum two columns only if the values of one column is bigger/greater 0

I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0

Cumulative sales data with threshold value forming a new series / column with a boolean value?

I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0

Calculate aggregate value of column row by row

My apologies for the vague title, it's complicated to translate what I want in writing terms.
I'm trying to build a filled line chart with the date on x axis and total transaction over time on the y axis
My data
The object is a pandas dataframe.
date | symbol | type | qty | total
----------------------------------------------
2020-09-10 ABC Buy 5 10
2020-10-18 ABC Buy 2 20
2020-09-19 ABC Sell 3 15
2020-11-05 XYZ Buy 10 8
2020-12-03 XYZ Buy 10 9
2020-12-05 ABC Buy 2 5
What I whant
date | symbol | type | qty | total | aggregate_total
------------------------------------------------------------
2020-09-10 ABC Buy 5 10 10
2020-10-18 ABC Buy 2 20 10+20 = 30
2020-09-19 ABC Sell 3 15 10+20-15 = 15
2020-11-05 XYZ Buy 10 8 8
2020-12-03 XYZ Buy 10 9 8+9 = 17
2020-12-05 ABC Buy 2 5 10+20-15+5 = 20
Where I am now
I'm working with 2 nested for loops : one for iterating over the symbols, one for iterating each row. I store the temporary results in lists. I'm still unsure how I will add the results to the final dataframe. I could reorder the dataframe by symbol and date, then append each temp lists together and finally assign that temp list to a new column.
The code below is just the inner loop over the rows.
af = df.loc[df['symbol'] == 'ABC']
for i in (range(0,af.shape[0])):
# print(af.iloc[0:i,[2,4]])
# if type is a buy, we add the last operation to the aggregate
if af.iloc[i,2] == "BUY":
temp_agg_total.append(temp_agg_total[i] + af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] + af.iloc[i, 3])
else:
temp_agg_total.append(temp_agg_total[i] - af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] - af.iloc[i, 3])
# Remove first element of list (0)
temp_agg_total.pop(0)
temp_agg_qty.pop(0)
af = af.assign(agg_total = temp_agg_total,
agg_qty = temp_agg_qty)
My question
Is there a better way to do this in pandas or numpy ? It feels really heavy for something relatively simple.
The presence of the Buy/Sell type of operation complicates things.
Regards
# negate qty of Sells
df.loc[df['type']=='Sell', 'total'] *=-1
# cumulative sum of the qty based on symbol
df['aggregate_total'] = df.groupby('symbol')['total'].cumsum()
Is this which you're looking for?
df['Agg'] = 1
df.loc[df['type'] == 'Sell', 'Agg'] = -1
df['Agg'] = df['Agg']*df['total']
df['Agg'].cumsum()
df["Type_num"] = df["type"].map({"Buy":1,"Sell":-1})
df["Num"] = df.Type_num*df.total
df.groupby(["symbol"],as_index=False)["Num"].cumsum()
pd.concat([df,df.groupby(["symbol"],as_index=False)["Num"].cumsum()],axis=1)
date symbol type qty total Type_num Num CumNum
0 2020-09-10 ABC Buy 5 10 1 10 10
1 2020-10-18 ABC Buy 2 20 1 20 30
2 2020-09-19 ABC Sell 3 15 -1 -15 15
3 2020-11-05 XYZ Buy 10 8 1 8 8
4 2020-12-03 XYZ Buy 10 9 1 9 17
5 2020-12-05 ABC Buy 2 5 1 5 20
The most important thing here is the cumulative sum. The regrouping is used to make sure that the cumulative sum is just performed on each kind of different symbol. The renaming and dropping of columns should be easy for you.
Trick is that I made {sell; buy} into {1,-1}

pandas groupby by customized year, e.g. a school year

In a pandas data frame I would like to find the mean values of a column, grouped by a 'customized' year.
An example would be to compute the mean values of school marks for a school year (e.g. Sep/YYYY to Aug/YYYY+1).
The pandas docs gives some information on offsets and business year etc., but I can't really make any sense out of that to get a working example.
Here is a minimal example where mean values of school marks are computed per year (Jan-Dec), which is what I do not want.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(low=1, high=5, size=36),
index=pd.date_range('2001-09-01', freq='M', periods=36),
columns=['marks'])
df_yearly = df.groupby(pd.Grouper(freq="A")).mean()
This could yield e.g.:
print(df):
marks
2001-09-30 1
2001-10-31 4
2001-11-30 2
2001-12-31 1
2002-01-31 4
2002-02-28 1
2002-03-31 2
2002-04-30 1
2002-05-31 3
2002-06-30 3
2002-07-31 3
2002-08-31 3
2002-09-30 4
2002-10-31 1
...
2003-11-30 4
2003-12-31 2
2004-01-31 1
2004-02-29 2
2004-03-31 1
2004-04-30 3
2004-05-31 4
2004-06-30 2
2004-07-31 2
2004-08-31 4
print(df_yearly):
marks
2001-12-31 2.000000
2002-12-31 2.583333
2003-12-31 2.666667
2004-12-31 2.375000
My desired output would correspond to something like:
2001-09/2002-08 mean_value
2002-09/2003-08 mean_value
2003-09/2004-08 mean_value
Many thanks!
We can manually compute the school years:
# if month>=9 we move it to the next year
school_years = df.index.year + (df.index.month>8).astype(int)
Another option is to use fiscal year starting from September:
school_years = df.index.to_period('Q-AUG').qyear
And we can groupby:
df.groupby(school_years).mean()
Output:
marks
2002 2.333333
2003 2.500000
2004 2.500000
One more approach
a = (df.index.month == 9).cumsum()
val = df.groupby(a, sort=False)['marks'].mean().reset_index()
dates = df.index.to_series().groupby(a, sort=False).agg(['first', 'last']).reset_index()
dates.merge(val, on='index')
Output
index first last marks
0 1 2001-09-30 2002-08-31 2.750000
1 2 2002-09-30 2003-08-31 2.333333
2 3 2003-09-30 2004-08-31 2.083333

Applying function to Pandas Groupby

I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50

Categories

Resources