Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848
Related
The Task
I have a dataframe that looks like this:
date
money_spent ($)
meals_eaten
weight
2021-01-01 10:00:00
350
5
140
2021-01-02 18:00:00
250
2
170
2021-01-03 12:10:00
200
3
160
2021-01-04 19:40:00
100
1
150
I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.
So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
first date of the cut
average meals_eaten
minimum weight
maximum weight
So the final table would be two rows like this:
date
cumulative_spent ($)
meals_eaten
min_weight
max_weight
2021-01-01 10:00:00
600
3.5
140
170
2021-01-03 12:10:00
300
2
150
160
My Approach:
My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.
Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.
Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.
The customized function to achieve the cumsum with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
#njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total += y
if total >= lim:
total = 0
check = 1
result.append(check)
return result
I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0
I am new to Python and can see at least 5 similar questions and this one is very close but non of them work for me.
I have a dataframe with non-unique customers.
customer_id amount male age income days reward difficulty duration
0 id_1 16.06 1 45 62000.0 608 2.0 10.0 10.0
1 id_1 18.00 1 45 62000.0 608 2.0 10.0 10.0
I am trying to group them by customer_id, sum by amount and keep all other columns PLUS add one column total, counting my transactions
Desired output
customer_id amount male age income days reward difficulty duration total
0 id_1 34.06 1 45 62000.0 608 2.0 10.0 10.0 2
My best personal attempt so far does not preserve all columns
groupby('customer_id')['amount'].agg(total_sum = 'sum', total = 'count')
You could do it this way, include all other columns in your groupby then reset_index after aggregating:
df.groupby(df.columns.difference(['amount']).tolist())['amount']\
.agg(total_sum='sum',total='count').reset_index()
Output:
age customer_id days difficulty duration income male reward total_sum total
0 45 id_1 608 10.0 10.0 62000.0 1 2.0 34.06 2
you could do:
grouper = df.groupby('customer_id')
first_dict = {col: 'first' for col in df.columns.difference(['customer_id', 'amount'])}
o = grouper.agg({
'amount': 'size',
**first_dict,
})
o['total'] = grouper.size().values
Based on #Scott Boston's answer, I found an answer myself too and I acknowledge that my solution is not elegant (maybe something will help to clean it). But it gives me an expanded solution, when I have non-unique rows (for instance, each customer_id has five different transactions).
df.groupby('customer_id').agg({'amount':['sum'], 'reward_':['sum'], 'difficulty':['mean'],
'duration':['mean'], 'male':['mean'], 'male':['mean'],
'income':['mean'], 'days':['mean'], 'age':['mean'],
'customer_id':['count']}).reset_index()
df_grouped = starbucks_grouped.droplevel(1, axis = 1)
My output is
Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('3/31/2018', periods=365, freq='D'), 6, replace=False)})
print(df)
VipNo Quantity OrderDate
0 0 118 2019-02-16
1 0 49 2019-03-25
2 1 113 2018-05-11
3 1 127 2019-02-18
4 2 124 2018-12-27
5 2 71 2018-05-14
I want to create a new column that shows the percentage of each customer's total quantity purchased in 2018-10-01 - 2019-03-31 compared to that in 2018-03-31 - 2019-03-31. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. My dataset is big so a customer may have ordered multiple times within both of the time range and I would want to use the sum of the orders.
(df.assign(Quantity6=df['OrderDate'].between("2018-10-01","2019-03-31")*df.Quantity)
.assign(Quantity12=df['OrderDate'].between("2018-03-31","2019-03-31")*df.Quantity)
.groupby('VipNo')[['Quantity6','Quantity12']]
.sum()
.assign(output=lambda x: x['Quantity6']/x['Quantity12'])
)
Quantity6 Quantity12 output
VipNo
0 167 167 1.000000
1 127 240 0.529167
2 124 195 0.635897
This code now can achieve this goal and I know I can drop Quantity6 and Quantity12. But all I need is one column "output" which I want to put it in a dataframe I created earlier and I want to keep the code short. How can I create this output column without having to create other unnecessary columns?
Thank you in advance~
Just a few modifications in your code:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('3/31/2018', periods=365, freq='D'), 6, replace=False)}
).set_index("VipNo")
(df.assign(Quantity6=df['OrderDate'].between("2018-10-01","2019-03-31")*df.Quantity)
.assign(Quantity12=df['OrderDate'].between("2018-03-31","2019-03-31")*df.Quantity)
.groupby('VipNo')[['Quantity6','Quantity12']]
.sum()
.assign(output=lambda x: x['Quantity6']/x['Quantity12'])
)["output"].to_frame().join(df)
I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!
Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###