Cumulative sum that resets when the condition is no longer met - python

I have a dataframe with a column that consists of datetime values, one that consists of speed values, and one that consists of timedelta values between the rows.
I would want to get the cumulative sum of timedeltas whenever the speed is below 2 knots. When the speed rises above 2 knots, I would like this cumulative sum to reset to 0, and then to start summing at the next instance of speed observations below 2 knots.
I have started by flagging all observations of speed values < 2. I only manage to get the cumulative sum for all of the observations with speed < 2, but not a cumulative sum separated for each instance.
The dataframe looks like this, and cum_sum is the desired output:
datetime speed timedelta cum_sum flag
1-1-2019 19:30:00 0.5 0 0 1
1-1-2019 19:32:00 0.7 2 2 1
1-1-2019 19:34:00 0.1 2 4 1
1-1-2019 19:36:00 5.0 2 0 0
1-1-2019 19:38:00 25.0 2 0 0
1-1-2019 19:42:00 0.1 4 4 1
1-1-2019 19:49:00 0.1 7 11 1

You can use the method from "How to groupby consecutive values in pandas DataFrame" to get the groups where flag is either 1 or 0, and then you will just need to apply the cumsum on the timedelta column, and set those values where flag == 0 to 0:
gb = df.groupby((df['flag'] != df['flag'].shift()).cumsum())
df['cum_sum'] = gb['timedelta'].cumsum()
df.loc[df['flag'] == 0, 'cum_sum'] = 0
print(df)
will give
datetime speed timedelta flag cum_sum
0 1-1-2019 19:30:00 0.5 0 1 0
1 1-1-2019 19:32:00 0.7 2 1 2
2 1-1-2019 19:34:00 0.1 2 1 4
3 1-1-2019 19:36:00 5.0 2 0 0
4 1-1-2019 19:38:00 25.0 2 0 0
5 1-1-2019 19:42:00 0.1 4 1 4
6 1-1-2019 19:49:00 0.1 7 1 11

Note: Uses global variable
c = 0
def fun(x):
global c
if x['speed'] > 2.0:
c = 0
else:
c = x['timedelta']+c
return c
df = pd.DataFrame( {'datetime': ['1-1-2019 19:30:00']*7,
'speed': [0.5,.7,0.1,5.0,25.0,0.1,0.1], 'timedelta': [0,2,2,2,2,4,7]})
df['cum_sum']=df.apply(fun, axis=1)
datetime speed timedelta cum_sum
0 1-1-2019 19:30:00 0.5 0 0
1 1-1-2019 19:30:00 0.7 2 2
2 1-1-2019 19:30:00 0.1 2 4
3 1-1-2019 19:30:00 5.0 2 0
4 1-1-2019 19:30:00 25.0 2 0
5 1-1-2019 19:30:00 0.1 4 4
6 1-1-2019 19:30:00 0.1 7 11

Related

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

conditional backward fill a column in python

I have a data set (sample) like below
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 0
2019-05-04 0
2019-05-05 0
2019-05-06 0
2019-05-07 0
2019-05-08 1
2019-05-09 0
I want to transform it such that, if I encounter Value=1, then I take the 3 values from 2 days before and fill it as 1. Also set the current value to be 0.
In other words, the transformed data set should look like this
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 1
2019-05-04 1
2019-05-05 1
2019-05-06 0
2019-05-07 0
2019-05-08 0
2019-05-09 0
Do notice, that in the example above, 2019-05-08 was set to 0 after transformation, and 2019-05-03 to 2019-05-05 was set to 1 (last value set to 1 is 2 days before 2019-05-08 and 3 days preceding 2019-05-05 is also set to 1).
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I think I can do this via for loops, but was looking to see if any inbuilt functions can help me with this.
Thanks!
There could be more precise ways of solving this problem. However, I could only think of solving this using the index values(say i) where Value==1 and then grab the index values at preceding locations(2 dates before means i-3 and then two more values above it means i-4, i-5) and assign the Value to 1. Finally, set the Value back to 0 for the index location(s) that were originally found for Value==1.
In [53]: df = pd.DataFrame({'Date':['2019-05-01','2019-05-02', '2019-05-03','2019-05-04','2019-05-05', '2019-05-06','20
...: 19-05-07','2019-05-08','2019-05-09'], 'Value':[0,0,0,0,0,0,0,1,0]})
...:
...:
In [54]: val_1_index = df.loc[df.Value == 1].index.tolist()
In [55]: val_1_index_decr = [(i-3, i-4, i-5) for i in val_1_index]
In [56]: df.loc[df['Value'].index.isin([i for i in val_1_index_decr[0]]), 'Value'] = 1
In [57]: df.loc[df['Value'].index.isin(val_1_index), 'Value'] = 0
In [58]: df
Out[58]:
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 1
3 2019-05-04 1
4 2019-05-05 1
5 2019-05-06 0
6 2019-05-07 0
7 2019-05-08 0
8 2019-05-09 0
A one line solution, assuming that df is your original dataframe:
df['Value'] = pd.Series([1 if 1 in df.iloc[i+3:i+6].values else 0 for i in df.index])
Here I work on index rather than dates, so I assume that you have one day per row and days are consecutive as shown in your example.
To fit also for this request:
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I can propose a two line solution:
validones = [True if df.iloc[i]['Value'] == 1 and df.iloc[i+1]['Value'] == 0 else False for i in df.index]
df['Value'] = pd.Series([1 if any(validones[i+3:i+6]) else 0 for i in range(len(validones))])
Basically first I build a list of boolean to check if the 1 in df['Value'] is not followed by another 1 and use this boolean list to perform the substitutions.
No sure about the efficiency of this solution because one needs to create three new columns but this also works:
df['shiftedValues'] = \
df['Value'].shift(-3, fill_value=0) + \
df['Value'].shift(-4, fill_value=0) + \
df['Value'].shift(-5, fill_value=0)
Note that the shift is done by row and not by day.
To shift by actual days I would first index by dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df['shiftedValues'] = \
df['Value'].shift(-3, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-4, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-5, freq='1D', fill_value=0).asof(df.index)
# Out:
# Value shiftedValues
# Date
# 2019-05-01 0 0.0
# 2019-05-02 0 0.0
# 2019-05-03 0 1.0
# 2019-05-04 0 1.0
# 2019-05-05 0 1.0
# 2019-05-06 0 0.0
# 2019-05-07 0 0.0
# 2019-05-08 1 0.0
# 2019-05-09 0 0.0
Now this works correctly for dates, for instance if df is (note the missing and repeated days)
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 0
3 2019-05-04 0
4 2019-05-05 0
5 2019-05-05 0
6 2019-05-07 0
7 2019-05-08 1
8 2019-05-09 0
then you get
Value shiftedValues
Date
2019-05-01 0 0.0
2019-05-02 0 0.0
2019-05-03 0 1.0
2019-05-04 0 1.0
2019-05-05 0 1.0
2019-05-05 0 1.0
2019-05-07 0 0.0
2019-05-08 1 0.0
2019-05-09 0 0.0

How do I aggregate rows with an upper bound on column value?

I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3
Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop

Cumulative sum on time series split by consecutive negative or positive values

I have a time series data that looks like this:
date values
2017-05-01 1
2017-05-02 0.5
2017-05-03 -2
2017-05-04 -1
2017-05-05 -1.25
2017-05-06 0.5
2017-05-07 0.5
I would like to add a field that computes the cumulative sum of my time series by trend: sum of consecutive positive values, sum of consecutive negative values.
Something that looks like this:
date values newfield
2017-05-01 1 1 |
2017-05-02 0.5 1.5 |
2017-05-03 -2 -2 |
2017-05-04 -1 -3 |
2017-05-05 -1.25 -4.25 |
2017-05-06 0.5 0.5 |
2017-05-07 0.5 1 |
At the moment, I'm trying to use shift and then having conditions but this is really not efficient and I am realizing it is really not a good approach.
def pn(x, y):
if x < 0 and y < 0:
return 1
if x > 0 and y > 0:
return 1
else:
return 0
def consum(x,y,z):
if z == 0:
return x
if y == 1:
return x+y
test = pd.read_csv("./test.csv", sep=";")
test['temp'] = test.Value.shift(1)
test['temp2'] = test.apply(lambda row: pn(row['Value'], row['temp']), axis=1)
test['temp3'] = test.apply(lambda row: consum(row['Value'], row['temp'], row['temp2']), axis=1)
Date Value temp temp2 temp3
2017-05-01 1 nan 0 1
2017-05-02 0.5 1 1 1.5
2017-05-03 -2 0 0 -2
2017-05-04 -1 -2 1 nan
2017-05-05 -1.25 -1 1 nan
2017-05-06 0.5 -1.25 0 0.5
2017-05-07 0.5 0.5 1 nan
After that I'm lost. I could continue to shift my values and have lots of if statements but there must be a better way.
Putting 0 in with the positives, you can use the shift-compare-cumsum pattern:
In [33]: sign = df["values"] >= 0
In [34]: df["vsum"] = df["values"].groupby((sign != sign.shift()).cumsum()).cumsum()
In [35]: df
Out[35]:
date values vsum
0 2017-05-01 1.00 1.00
1 2017-05-02 0.50 1.50
2 2017-05-03 -2.00 -2.00
3 2017-05-04 -1.00 -3.00
4 2017-05-05 -1.25 -4.25
5 2017-05-06 0.50 0.50
6 2017-05-07 0.50 1.00
which works because (sign != sign.shift()).cumsum() gives us a new number for each contiguous group:
In [36]: sign != sign.shift()
Out[36]:
0 True
1 False
2 True
3 False
4 False
5 True
6 False
Name: values, dtype: bool
In [37]: (sign != sign.shift()).cumsum()
Out[37]:
0 1
1 1
2 2
3 2
4 2
5 3
6 3
Name: values, dtype: int64
Create a groups:
g = np.sign(df['values']).diff().ne(0).cumsum()
g
Output:
0 1
1 1
2 2
3 2
4 2
5 3
6 3
Name: values, dtype: int64
Now, use g as a groupby with cumsum
df.groupby(g).cumsum()
Output:
values
0 1.00
1 1.50
2 -2.00
3 -3.00
4 -4.25
5 0.50
6 1.00

Computing 7-day retention numbers in pandas

I have a dataframe with two columns--date and id. I'd like to calculate for each date the number of id's on that date which reappear on a later date within 7 days. If I were doing this in postgres, it would look something like:
SELECT df1.date, COUNT(DISTINCT df1.id)
FROM df df1 INNER JOIN df df2
ON df1.id = df2.id AND
df2.date BETWEEN df1.date + 1 AND df1.date + 7
GROUP BY df1.date;
What is problematic for me is how to translate this statement into pandas in a way that is fast and idiomatic and etc.
I've already tried for one-day retention by simply creating a lagged column and merging the original with the lagged dataframe. This certainly works. However, for seven-day retention I would need to create 7 dataframes and merge them together. That's not reasonable, as far as I'm concerned. (Especially because I'd also like to know 30-day numbers.)
(I should also point out that my research led me to https://github.com/pydata/pandas/issues/2996, which indicates a merge behavior that does not work on my install (pandas 0.14.0) which fails with error message TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got Series). So there appears to be some sort of advanced merge/join behavior which I clearly don't know how to activate.)
If I understand you correctly, I think you can do it with a groupby/apply. It's a bit tricky. So I think you have data like the following:
>>> df
date id y
0 2012-01-01 1 0.1
1 2012-01-03 1 0.3
2 2012-01-09 1 0.4
3 2012-01-12 1 0.0
4 2012-01-14 1 0.2
5 2012-01-16 1 0.4
6 2012-01-01 2 0.2
7 2012-01-02 2 0.1
8 2012-01-03 2 0.4
9 2012-01-04 2 0.6
10 2012-01-09 2 0.7
11 2012-01-10 2 0.4
I'm going to create rolling forward count within an 'id' group of the number of times that id shows up in the next 7 days including the current day:
def count_forward7(g):
# Add column to the datframe so I can set date as the index
g['foo'] = 1
# New dataframe with daily frequency, so 7 rows = 7 days
# If there are no gaps in the dates you don't need to do this
x = g.set_index('date').resample('D')
# Do Andy Hayden Method for a forward looking rolling windows
# reverses the series and then reverses back the answer
fsum = pd.rolling_sum(x[::-1],window=7,min_periods=0)[::-1]
return pd.DataFrame(fsum[fsum.index.isin(g.date)].values,index=g.index)
>>> df['f7'] = df.groupby('id')[['date']].apply(count_forward7)
>>> df
date id y f7
0 2012-01-01 1 0.1 2
1 2012-01-03 1 0.3 2
2 2012-01-09 1 0.4 3
3 2012-01-12 1 0.0 3
4 2012-01-14 1 0.2 2
5 2012-01-16 1 0.4 1
6 2012-01-01 2 0.2 4
7 2012-01-02 2 0.1 3
8 2012-01-03 2 0.4 3
9 2012-01-04 2 0.6 3
10 2012-01-09 2 0.7 2
11 2012-01-10 2 0.4 1
Now if you want to now "calculate for each date the number of id's on that date which reappear on a later date within 7 days" just count for each date where f7 > 1:
>>> df['bool_f77'] = df['f7'] > 1
>>> df.groupby('date')['bool_f77'].sum()
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-10 0
2012-01-12 1
2012-01-14 1
2012-01-16 0
Or Something like the following:
>>> df.query('f7 > 1').groupby('date')['date'].count()
date
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-12 1
2012-01-14 1

Categories

Resources