Taking first and last value in a rolling window - python

Initial problem statement
Using pandas, I would like to apply function available for resample() but not for rolling().
This works:
df1 = df.resample(to_freq,
closed='left',
kind='period',
).agg(OrderedDict([('Open', 'first'),
('Close', 'last'),
]))
This doesn't:
df2 = df.rolling(my_indexer).agg(
OrderedDict([('Open', 'first'),
('Close', 'last') ]))
>>> AttributeError: 'first' is not a valid function for 'Rolling' object
df3 = df.rolling(my_indexer).agg(
OrderedDict([
('Close', 'last') ]))
>>> AttributeError: 'last' is not a valid function for 'Rolling' object
What would be your advice to keep first and last value of a rolling windows to be put into two different columns?
EDIT 1 - with usable input data
import pandas as pd
from random import seed
from random import randint
from collections import OrderedDict
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0,10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
# First & last work with resample
resampled_first = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'first')]))
resampled_last = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'last')]))
# They don't with rolling
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'first')]))
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'last')]))
Thanks for your help!
Bests,

You can use own function to get first or last element in rolling window
rolling_first = df.rolling(3).agg(lambda rows: rows[0])
rolling_last = df.rolling(3).agg(lambda rows: rows[-1])
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
df['last'] = df['Values'].rolling(3).agg(lambda rows: rows[-1])
print(df)
Result
Values first last
2020-01-01 00:00:00+00:00 2 NaN NaN
2020-01-01 01:00:00+00:00 9 NaN NaN
2020-01-01 02:00:00+00:00 1 2.0 1.0
2020-01-01 03:00:00+00:00 4 9.0 4.0
2020-01-01 04:00:00+00:00 1 1.0 1.0
2020-01-01 05:00:00+00:00 7 4.0 7.0
2020-01-01 06:00:00+00:00 7 1.0 7.0
2020-01-01 07:00:00+00:00 7 7.0 7.0
2020-01-01 08:00:00+00:00 10 7.0 10.0
2020-01-01 09:00:00+00:00 6 7.0 6.0
2020-01-01 10:00:00+00:00 3 10.0 3.0
2020-01-01 11:00:00+00:00 1 6.0 1.0
2020-01-01 12:00:00+00:00 7 3.0 7.0
2020-01-01 13:00:00+00:00 0 1.0 0.0
2020-01-01 14:00:00+00:00 6 7.0 6.0
2020-01-01 15:00:00+00:00 6 0.0 6.0
2020-01-01 16:00:00+00:00 9 6.0 9.0
2020-01-01 17:00:00+00:00 0 6.0 0.0
2020-01-01 18:00:00+00:00 7 9.0 7.0
2020-01-01 19:00:00+00:00 4 0.0 4.0
2020-01-01 20:00:00+00:00 3 7.0 3.0
2020-01-01 21:00:00+00:00 9 4.0 9.0
2020-01-01 22:00:00+00:00 1 3.0 1.0
2020-01-01 23:00:00+00:00 5 9.0 5.0
2020-01-02 00:00:00+00:00 0 1.0 0.0
EDIT:
Using dictionary you have to put directly lambda, not string
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
The same with own function - you have to put its name, not string with name
def first(rows):
return rows[0]
def last(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
def first(rows):
return rows[0]
def mylast(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)

In case anyone else needs to find the difference between the first and last value in a 'rolling-window'. I used this on stock market data and wanted to know the price difference from the beginning to the end of the 'window' so I created a new column which used the current row 'close' value and the 'open' value using .shift() so it is taking the "open" value from 60 rows above.
df[windowColumn] = df["close"] - (df["open"].shift(60))
I think it's a very quick method for large datasets.

Related

how to use pandas fillna NaN with the negative of the next row value

I have a table of daily (time series) rain of cities. how to use pandas fillna NaN by the negative of the next day rain of the same city? Thank you. 
import pandas as pd
import numpy as np
rain_before = pd.DataFrame({'date':Date*2,'city':list('aaaaabbbbb'),'rain':[6,np.nan,1,np.nan,np.nan,4,np.nan,np.nan,8,np.nan]})
# after fillna, the table should look like this.
rain_after_fillna = pd.DataFrame({'date':Date*2,'city':list('aaaaabbbbb'),'rain':[6,-1,1,np.nan,np.nan,4,np.nan,-8,8,np.nan]})
You can you shift and fillna
rain_before['rain'].fillna(rain_before.groupby('city')['rain']
.transform(lambda x: -x.shift(-1)))
0 6.0
1 -1.0
2 1.0
3 NaN
4 NaN
5 4.0
6 NaN
7 -8.0
8 8.0
9 NaN
Name: rain, dtype: float64
Using the series of shift(-1)*-1. There is no sample dataset so I've synthesized and not included city. Same approach can be used for city, sort order needs to be considered
import datetime as dt
import random
df = pd.DataFrame({"Date":pd.date_range(dt.date(2021,1,1), dt.date(2021,1,10))
,"rainfall":[i*random.randint(0,1) for i in range(10)]}).replace({0:np.nan})
df["rainfall_nan"] = df["rainfall"].fillna(df["rainfall"].shift(-1)*-1)
output
Date rainfall rainfall_nan
2021-01-01 NaN -1.0
2021-01-02 1.0 1.0
2021-01-03 2.0 2.0
2021-01-04 3.0 3.0
2021-01-05 NaN -5.0
2021-01-06 5.0 5.0
2021-01-07 6.0 6.0
2021-01-08 7.0 7.0
2021-01-09 NaN -9.0
2021-01-10 9.0 9.0

Calculate delta between two columns and two following rows for different group

Are there any vector operations for improving runtime?
I found no other way besides for loops.
Sample DataFrame:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan','06-Jan', '10-Jan'],
'start_value': [12, 15, 1, 3, 2, 6],
'end_value': [20, 17, 6,19,13.5,9]})
ID start_date start_value end_value
0 1 01-Jan 12 20.0
1 1 05-Jan 15 17.0
2 1 08-Jan 1 6.0
3 2 05-Jan 3 19.0
4 2 06-Jan 2 13.5
5 2 10-Jan 6 9.0
I've tried:
import pandas as pd
df_original # contains data
data_frame_diff= pd.DataFrame()
for ID in df_original ['ID'].unique():
tmp_frame = df_original .loc[df_original ['ID']==ID]
tmp_start_value = 0
for label, row in tmp_frame.iterrows():
last_delta = tmp_start_value - row['value']
tmp_start_value = row['end_value']
row['last_delta'] = last_delta
data_frame_diff= data_frame_diff.append(row,True)
Expected Result:
df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'],
'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan', '06-Jan',
'10-Jan'],
'last_delta': [0, 5, 16, 0, 17, 7.5]})
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
I want to calculate the delta between start_value and end_value of the timestamp and the following timestamp after for each user ID.
Is there a way to improve runtime of this code?
Use DataFrame.groupby
on ID and shift the column end_value then use Series.sub to subtract it from start_value, finally use Series.fillna and assign this new column s to the dataframe using DataFrame.assign:
s = df.groupby('ID')['end_value'].shift().sub(df['start_value']).fillna(0)
df1 = df[['ID', 'start_date']].assign(last_delta=s)
Result:
print(df1)
ID start_date last_delta
0 1 01-Jan 0.0
1 1 05-Jan 5.0
2 1 08-Jan 16.0
3 2 05-Jan 0.0
4 2 06-Jan 17.0
5 2 10-Jan 7.5
It's a bit difficult to follow from your description what you need, but you might find this helpful:
import pandas as pd
df = (pd.DataFrame({'t1': pd.date_range(start="2020-01-01", end="2020-01-02", freq="H"),
})
.reset_index().rename(columns={'index': 'ID'})
)
df['t2'] = df['t1']+pd.Timedelta(value=10, unit="H")
df['delta_t1_t2'] = df['t2']-df['t1']
df['delta_to_previous_t1'] = df['t1'] - df['t1'].shift()
print(df)
It results in
ID t1 t2 delta_t1_t2 delta_to_previous_t1
0 0 2020-01-01 00:00:00 2020-01-01 10:00:00 10:00:00 NaT
1 1 2020-01-01 01:00:00 2020-01-01 11:00:00 10:00:00 01:00:00
2 2 2020-01-01 02:00:00 2020-01-01 12:00:00 10:00:00 01:00:00
3 3 2020-01-01 03:00:00 2020-01-01 13:00:00 10:00:00 01:00:00

Duplicating n rows in a DataFrame using 2nd level index?

I have a pandas DataFrame that for instance is looking like this.
df
Values
Timestamp
2020-02-01 A
2020-02-02 B
2020-02-03 C
I would like (to ease processing to be done afterward) to keep a window of n row and duplicate it for each timestamp, and creating a 2nd level index with local int index.
With n=2, this would give:
df_new
Values
Timestamp 2nd_level_index
2020-02-01 0 NaN
1 A
2020-02-02 0 A
1 B
2020-03-03 0 B
1 C
Is there any kind of pandas built-in function that would help me do that?
A rolling window with fixed size (n) seems to be the start, but then how do I duplicate the window and store it for each row using a 2nd level index?
Thanks in advance for any help!
Bests,
EDIT 04/05
Taking propose code, and changing a bit the output format, I adapted it for a 2-column DataFrame.
I ended up with following code.
import pandas as pd
import numpy as np
from random import seed, randint
def transpose_n_rows(df: pd.DataFrame, n_rows: int) -> pd.DataFrame:
array = np.concatenate((np.full((len(df.columns),n_rows-1), np.nan), df.transpose()), axis=1)
shape = array.shape[:-1] + (array.shape[-1] - n_rows + 1, n_rows)
strides = array.strides + (array.strides[-1],)
array = np.lib.stride_tricks.as_strided(array, shape=shape, strides=strides)
midx = pd.MultiIndex.from_product([df.columns, range(n_rows)], names=['Data','Position'])
transposed = pd.DataFrame(np.concatenate(array, axis=1), index=df.index, columns=midx)
return transposed
n = 4
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 12:00+00:00'
pr2h = pd.period_range(start=start, end=end, freq='2h')
seed(1)
values1 = [randint(0,10) for ts in pr2h]
values2 = [randint(20,30) for ts in pr2h]
df2h = pd.DataFrame({'Values1' : values1, 'Values2': values2}, index=pr2h)
df2h_new = transpose_n_rows(df2h, n)
Which gives.
In [29]:df2h
Out[29]:
Values1 Values2
2020-01-01 00:00 2 27
2020-01-01 02:00 9 30
2020-01-01 04:00 1 26
2020-01-01 06:00 4 23
2020-01-01 08:00 1 21
2020-01-01 10:00 7 27
2020-01-01 12:00 7 20
In [30]:df2h_new
Out[30]:
Data Values1 Values2
Position 0 1 2 3 0 1 2 3
2020-01-01 00:00 NaN NaN NaN 2.0 NaN NaN NaN 27.0
2020-01-01 02:00 NaN NaN 2.0 9.0 NaN NaN 27.0 30.0
2020-01-01 04:00 NaN 2.0 9.0 1.0 NaN 27.0 30.0 26.0
2020-01-01 06:00 2.0 9.0 1.0 4.0 27.0 30.0 26.0 23.0
2020-01-01 08:00 9.0 1.0 4.0 1.0 30.0 26.0 23.0 21.0
2020-01-01 10:00 1.0 4.0 1.0 7.0 26.0 23.0 21.0 27.0
2020-01-01 12:00 4.0 1.0 7.0 7.0 23.0 21.0 27.0 20.0
However, I am calling this function transpose_n_rows in a for loop for a significant number of DataFrames. This first use makes me a bit afraid with performance issues.
I could read that one should avoid multiple calls to np.concatenate or pd.concat, and here, I have 2 of them for a use that maybe can be bypassed?
Please, is there any advice to get rid of them if this is possible?
I thank you in advance for any help! Bests,
I think there is not built-in method in pandas.
Possible solution with strides for generate rolling 2d array:
n = 2
#added Nones for first values of 2d array
x = np.concatenate([[None] * (n-1), df['Values']])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[None 'A']
['A' 'B']
['B' 'C']]
And then create MultiIndex by MultiIndex.from_product and flatten values of array by numpy.ravel:
mux = pd.MultiIndex.from_product([df.index, range(n)], names=('times','level1'))
df = pd.DataFrame({'Values': np.ravel(a)}, index=mux)
print (df)
Values
times level1
2020-02-01 0 None
1 A
2020-02-02 0 A
1 B
2020-02-03 0 B
1 C
If values are numbers add missing values:
print (df)
Values
Timestamp
2020-02-01 1
2020-02-02 2
2020-02-03 3
n = 2
x = np.concatenate([[np.nan] * (n-1), df['Values']])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[nan 1.]
[ 1. 2.]
[ 2. 3.]]
mux = pd.MultiIndex.from_product([df.index, range(n)], names=('times','level1'))
df = pd.DataFrame({'Values': np.ravel(a)}, index=mux)
print (df)
Values
times level1
2020-02-01 0 NaN
1 1.0
2020-02-02 0 1.0
1 2.0
2020-02-03 0 2.0
1 3.0

Date subraction not working with date range function

I am trying subtract two datetimes when there valid there is valid value for T1,T2 obtain the difference. The difference is caluclated by considering only weekdays between the dates not considering saturday and sunday.
Code works for only some rows. How can this be fixed.
T1 T2 Diff
0 2017-12-04 05:48:15 2018-01-05 12:15:22 NaN
1 2017-07-10 08:23:11 2018-01-05 15:28:22 NaN
2 2017-09-11 05:10:37 2018-01-29 15:02:07 NaN
3 2017-12-21 04:51:12 2018-01-29 16:06:43 NaN
4 2017-10-13 10:11:00 2018-02-22 16:19:04 NaN
5 2017-09-28 21:44:31 2018-01-29 12:42:02 NaN
6 2018-01-23 20:00:58 2018-01-29 14:40:33 NaN
7 2017-11-28 15:39:38 2018-01-31 11:57:04 NaN
8 2017-12-21 12:44:00 2018-01-31 13:12:37 30.0
9 2017-11-09 05:52:29 2018-01-22 11:42:01 53.0
10 2018-02-12 04:21:08 NaT NaN
df[['T1','T2','diff']].dtypes
T1 datetime64[ns]
T2 datetime64[ns]
diff float64
df['T1'] = pd.to_datetime(df['T1'])
df['T2'] = pd.to_datetime(df['t2'])
def fun(row):
if row.isnull().any():
return np.nan
ts = pd.DataFrame(pd.date_range(row["T1"],row["T2"]), columns=["date"])
ts["dow"] = ts["date"].dt.weekday
return (ts["dow"]<5).sum()
df["diff"] = df.apply(lambda x: fun(x), axis=1)
Instead of trying to check for a null value in the row, use a try/except to capture the error when it does the calculation with a null value.
This worked for me in, I think, the manner you want.
import pandas as pd
import numpy as np
df = pd.read_csv("/home/rightmire/Downloads/test.csv", sep=",")
# df = df[["m1","m2"]]
print(df)
# print(df[['m1','m2']].dtypes)
df['m1'] = pd.to_datetime(df['m1'])
df['m2'] = pd.to_datetime(df['m2'])
print(df[['m1','m2']].dtypes)
#for index, row in df.iterrows():
def fun(row):
try:
ts = pd.DataFrame(pd.date_range(row["m1"],row["m2"]), columns=["date"])
# print(ts)
ts["dow"] = ts["date"].dt.weekday
result = (ts["dow"]<5).sum()
# print("Result = ", result)
return result
except Exception as e:
# print("ERROR:{}".format(str(e)))
result = np.nan
# print("Result = ", result)
return result
df["diff"] = df.apply(lambda x: fun(x), axis=1)
print(df["diff"])
OUTPUT OF INTEREST:
dtype: object
0 275.0
1 147.0
2 58.0
3 28.0
4 95.0
5 87.0
6 4.0
7 46.0
8 30.0
9 96.0
10 NaN
11 27.0
12 170.0
13 158.0
14 79.0
Name: diff, dtype: float64

How to calculate inverse cumsum in pandas

I am trying to find a way to calculate an inverse cumsum for pandas. This means applying cumsum but from bottom to top. The problem I'm facing is, I'm trying to find the number of workable day for each month for Spain both from top to bottom (1st workable day = 1, 2nd = 2, 3rd = 3, etc...) and bottom to top (last workable day = 1, day before last = 2, etc...).
So far I managed to get the top to bottom order to work but can't get the inverse order to work, I've searched a lot and couldn't find a way to perform an inverse cummulative sum:
import pandas as pd
from datetime import date
from workalendar.europe import Spain
import numpy as np
cal = Spain()
#print(cal.holidays(2019))
rng = pd.date_range('2019-01-01', periods=365, freq='D')
df = pd.DataFrame({ 'Date': rng})
df['flag_workable'] = df['Date'].apply(lambda x: cal.is_working_day(x))
df_workable = df[df['flag_workable'] == True]
df_workable['month'] = df_workable['Date'].dt.month
df_workable['workable_day'] = df_workable.groupby('month')['flag_workable'].cumsum()
print(df)
print(df_workable.head(30))
Output for January:
Date flag_workable month workable_day
1 2019-01-02 True 1 1.0
2 2019-01-03 True 1 2.0
3 2019-01-04 True 1 3.0
6 2019-01-07 True 1 4.0
7 2019-01-08 True 1 5.0
Example for last days of January:
Date flag_workable month workable_day
24 2019-01-25 True 1 18.0
27 2019-01-28 True 1 19.0
28 2019-01-29 True 1 20.0
29 2019-01-30 True 1 21.0
30 2019-01-31 True 1 22.0
This would be the expected output after applying the inverse cummulative:
Date flag_workable month workable_day inv_workable_day
1 2019-01-02 True 1 1.0 22.0
2 2019-01-03 True 1 2.0 21.0
3 2019-01-04 True 1 3.0 20.0
6 2019-01-07 True 1 4.0 19.0
7 2019-01-08 True 1 5.0 18.0
Last days of January:
Date flag_workable month workable_day inv_workable_day
24 2019-01-25 True 1 18.0 5.0
27 2019-01-28 True 1 19.0 4.0
28 2019-01-29 True 1 20.0 3.0
29 2019-01-30 True 1 21.0 2.0
30 2019-01-31 True 1 22.0 1.0
Invert the row order of the DataFrame prior to grouping so that the cumsum is calculated in reverse order within each month.
df['inv_workable_day'] = df[::-1].groupby('month')['flag_workable'].cumsum()
df['workable_day'] = df.groupby('month')['flag_workable'].cumsum()
# Date flag_workable month inv_workable_day workable_day
#1 2019-01-02 True 1 5.0 1.0
#2 2019-01-03 True 1 4.0 2.0
#3 2019-01-04 True 1 3.0 3.0
#6 2019-01-07 True 1 2.0 4.0
#7 2019-01-08 True 1 1.0 5.0
#8 2019-02-01 True 2 1.0 1.0
Solution
Whichever column you want to apply cumsum to you have two options:
Order descending a copy of that column by index, followed by cumsum and then order ascending by index. Finally assign it back to the data frame column.
Use numpy:
import numpy as np
array = df.column_data.to_numpy()
array = np.flip(array) # to flip the order
array = np.cumsum(array)
array = np.flip(array) # to flip back to original order
df.column_data_cumsum = array

Categories

Resources