I would like to calculate standard deviations for non rolling intervals.
I have a df like this:
value std year
3 nan 2001
2 nan 2001
4 nan 2001
19 nan 2002
23 nan 2002
34 nan 2002
and so on. I would just like to calculate the standard deviation for every year and save it in every cell in the respective row in "std". I have the same amount of data for every year, thus the length of the intervals never changes.
I already tried:
df["std"] = df.groupby("year").std()
but since the right gives a new dataframe that calculates the std for every column gouped by year this obviously does not work.
Thank you all very much for your support!
IIUC:
try via transform() method:
df['std']=df.groupby("year")['value'].transform('std')
OR
If you want to find the standard deviation of multiple columns then:
df[['std1','std2']]=df.groupby("year")[['column1','column2']].transform('std')
I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0
I am trying to parse and organize a trading history file.I am trying to aggregate every 3 or 4 rows that has the same Type: BUY or SELL together if they come after each other only. and if they don't then I want to take only one row.
as you can see the example below those multi-buy trades I want them to be aggregated within one row which after it will come another one row of sell trade.
in a new df with aggregated trades prices, and amounts.
link for csv: https://drive.google.com/file/d/1GoDRdI7G8uJzuLoFrm5InbDg23mAwW6o/view?usp=sharing
You can use this to get the results you are looking for. I am using cumulative sum when the previous value is not equal to the current value.
dictionary = { "BUY": 1, "SELL": 0}
df['id1'] = df['Type'].map(dictionary)
df['grp'] = (df['id1']!=df['id1'].shift()).cumsum()
Now you can aggregate the values using a simple groupby like below. This will sum the amount for each consecutive buy and sell
df.groupby(['grp'])['Amount'].sum()
This is the output of grp column.
Type grp
0 BUY 1
1 BUY 1
2 BUY 1
3 BUY 1
4 SELL 2
5 SELL 2
6 SELL 2
7 SELL 2
8 BUY 3
9 SELL 4
10 BUY 5
11 SELL 6
12 BUY 7
13 SELL 8
14 BUY 9
15 BUY 9
16 SELL 10
17 SELL 10
I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!
Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###
I am having trouble illustrating my problem with the form the data is in without complicating things. So bear with me as I would like to start with the following screen shot is for explaining the problem only (aka the data is not in this form) :
I would like to identify the past 14 days with a number > 0 across all bins (aka the total row has a value greater than 0). This would include all days except for days 5 and 12 (highlighted in red). I would then like to sum across bins horizontally for those 14 days (aka sum all days expect for 5 and 12, by bin), with the goal of ultimately calculating a 14 day average by Bin number.
Note the example above would be for one “Lane”, where my data has > 10,000. The example also only illustrates today being day 16. But I would like to apply this logic to every day in the data set. I.e. on day 20 (along with any other date), it would look at the last 14 days with a value across all bins, then use that data range to aggregate across Bin. This is a screenshot sample of how the data looks:
A simple example using the data as it is structured, with only 3 Bins, 1 Lane, and a 3 data point/date look back:
Lane Date Bin KG
AMS-ORD 2018-08-26 3 10
AMS-ORD 2018-08-29 1 25
AMS-ORD 2018-08-30 2 30
AMS-ORD 2018-09-03 2 20
AMS-ORD 2018-09-04 1 40
Note KG here is a sum. Again this is for one day (aka today), but I would like every date in my data set to follow the same logic. The output would look like the following:
Lane Date Bin KG Average
AMS-ORD 2018-09-04 1 40 13.33
AMS-ORD 2018-09-04 2 50 16.67
AMS-ORD 2018-09-04 3 0 -
I have messed around with .rolling(14).mean(), .tail(), and some others. The problem I have is specifying the correct date range for the correct Bin aggregation.