Pandas - Rolling slope calculation - python

How to calculate slope of each columns' rolling(window=60) value, stepped by 5?
I'd like to calculate every 5 minutes' value, and I don't need every record's results.
Here's sample dataframe and results:
df
Time A ... N
2016-01-01 00:00 1.2 ... 4.2
2016-01-01 00:01 1.2 ... 4.0
2016-01-01 00:02 1.2 ... 4.5
2016-01-01 00:03 1.5 ... 4.2
2016-01-01 00:04 1.1 ... 4.6
2016-01-01 00:05 1.6 ... 4.1
2016-01-01 00:06 1.7 ... 4.3
2016-01-01 00:07 1.8 ... 4.5
2016-01-01 00:08 1.1 ... 4.1
2016-01-01 00:09 1.5 ... 4.1
2016-01-01 00:10 1.6 ... 4.1
....
result
Time A ... N
2016-01-01 00:04 xxx ... xxx
2016-01-01 00:09 xxx ... xxx
2016-01-01 00:14 xxx ... xxx
...
Can df.rolling function be applied to this problem?
It's fine if NaN is in the window, meaning subset could be less than 60.

It seems that what you want is rolling with a specific step size.
However, according to the documentation of pandas, step size is currently not supported in rolling.
If the data size is not too large, just perform rolling on all data and select the results using indexing.
Here's a sample dataset. For simplicity, the time column is represented using integers.
data = pd.DataFrame(np.random.rand(500, 1) * 10, columns=['a'])
a
0 8.714074
1 0.985467
2 9.101299
3 4.598044
4 4.193559
.. ...
495 9.736984
496 2.447377
497 5.209420
498 2.698441
499 3.438271
Then, roll and calculate slopes,
def calc_slope(x):
slope = np.polyfit(range(len(x)), x, 1)[0]
return slope
# set min_periods=2 to allow subsets less than 60.
# use [4::5] to select the results you need.
result = data.rolling(60, min_periods=2).apply(calc_slope)[4::5]
The result will be,
a
4 -0.542845
9 0.084953
14 0.155297
19 -0.048813
24 -0.011947
.. ...
479 -0.004792
484 -0.003714
489 0.022448
494 0.037301
499 0.027189
Or, you can refer to this post. The first answer provides a numpy way to achieve this:
step size in pandas.DataFrame.rolling

try this
windows = df.groupby("Time")["A"].rolling(60)
df[out] = windows.apply(lambda x: np.polyfit(range(60), x, 1)[0], raw=True).values

You could use pandas Resample. Note that to use this , you need an index with time value
df.index = pd.to_datetime(df.Time)
print df
result = df.resample('5Min').bfill()
print result
Time A N
Time
2016-01-01 00:00:00 2016-01-01 00:00 1.2 4.2
2016-01-01 00:01:00 2016-01-01 00:01 1.2 4.0
2016-01-01 00:02:00 2016-01-01 00:02 1.2 4.5
2016-01-01 00:03:00 2016-01-01 00:03 1.5 4.2
2016-01-01 00:04:00 2016-01-01 00:04 1.1 4.6
2016-01-01 00:05:00 2016-01-01 00:05 1.6 4.1
2016-01-01 00:06:00 2016-01-01 00:06 1.7 4.3
2016-01-01 00:07:00 2016-01-01 00:07 1.8 4.5
2016-01-01 00:08:00 2016-01-01 00:08 1.1 4.1
2016-01-01 00:09:00 2016-01-01 00:09 1.5 4.1
2016-01-01 00:10:00 2016-01-01 00:10 1.6 4.1
2016-01-01 00:15:00 2016-01-01 00:15 1.6 4.1
Time A N
Output
Time
2016-01-01 00:00:00 2016-01-01 00:00 1.2 4.2
2016-01-01 00:05:00 2016-01-01 00:05 1.6 4.1
2016-01-01 00:10:00 2016-01-01 00:10 1.6 4.1
2016-01-01 00:15:00 2016-01-01 00:15 1.6 4.1

I use:
df['slope_I'] = df['I'].rolling('600s').apply(lambda x: (x[-1]-x[0])/600)
where the slope is something with 1/seconds units.
Probably the first 600s of the result will be empty, you should fill it with zeros, or with the mean.
The first number in the slope column will be the slope of the line that goes from the first row inside the window to the last, and so on during the rolling.
Best regards.

For other answer seekers, here I got another solution where the time interval does not need to be the same length.
df.A.diff(60)/df.Time.diff(60).dt.total_seconds()
This line of code takes the difference of the current row with sixty rows back and divide this by the difference in time of the same rows.
When you only want every fifth record then the next line should work.
df.A.diff(60)/df.Time.diff(60).dt.total_seconds()[4::5]
Note: every line is calculated and only the 5 stepped serie is returned
doc pandas diff: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html

Related

How to apply a condition to Pandas dataframe rows, but only apply the condition to rows of the same day?

I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success

Transform the Random time intervals to 30 mins Structured interval

I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.
IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00
The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02

Making matching algorithm between two data frames more efficient

I have two data frames eg.
Shorter time frame ( 4 hourly )
Time Data_4h
1/1/01 00:00 1.1
1/1/01 06:00 1.2
1/1/01 12:00 1.3
1/1/01 18:00 1.1
2/1/01 00:00 1.1
2/1/01 06:00 1.2
2/1/01 12:00 1.3
2/1/01 18:00 1.1
3/1/01 00:00 1.1
3/1/01 06:00 1.2
3/1/01 12:00 1.3
3/1/01 18:00 1.1
Longer time frame ( 1 day )
Time Data_1d
1/1/01 00:00 1.1
2/1/01 00:00 1.6
3/1/01 00:00 1.0
I want to label the shorter time frame data with the data from the longer time frame data but n-1 days, leaving NaN where the n-1 day doesn't exist.
For example,
Final merged data combining 4h and 1d
Time Data_4h Data_1d
1/1/01 00:00 1.1 NaN
1/1/01 06:00 1.2 NaN
1/1/01 12:00 1.3 NaN
1/1/01 18:00 1.1 NaN
2/1/01 00:00 1.1 1.1
2/1/01 06:00 1.2 1.1
2/1/01 12:00 1.3 1.1
2/1/01 18:00 1.1 1.1
3/1/01 00:00 1.1 1.6
3/1/01 06:00 1.2 1.6
3/1/01 12:00 1.3 1.6
3/1/01 18:00 1.1 1.6
So for 1/1 - it tried to find 31/12 but couldn't find it so it was labelled as NaN. For 2/1, it searched for 1/1 and labelled those entires with 1.1 - the value for 1/1. For 3/1, it searched for 2/1 and labelled those entires with 1.6 - the value for 2/1.
It is important to note that the timeframe datas may have large gaps. So I can't access the rows in the larger time frame directly.
What is the best way to do this?
Currently I am iterating through all the rows of the smaller timeframe and then searching for the larger time frame date using a filter like:
large_tf_data[(large_tf_data.index <= target_timestamp)][0]
Where target_timestamp is calculated on each row in the smaller time frame data frame.
This is extremely slow! Any suggestions on how to speed it up?
First, take care of dates
dayfirstme = lambda d: pd.to_datetime(d.Time, dayfirst=True)
df = df.assign(Time=dayfirstme)
df2 = df2.assign(Time=dayfirstme)
Then Convert df2 to something useful
d2 = df2.assign(Time=lambda d: d.Time + pd.Timedelta(1, 'D')).set_index('Time').Data_1d
Apply magic
df.join(df.Time.dt.date.map(d2).rename(d2.name))
Time Data_4h Data_1d
0 2001-01-01 00:00:00 1.1 NaN
1 2001-01-01 06:00:00 1.2 NaN
2 2001-01-01 12:00:00 1.3 NaN
3 2001-01-01 18:00:00 1.1 NaN
4 2001-01-02 00:00:00 1.1 1.1
5 2001-01-02 06:00:00 1.2 1.1
6 2001-01-02 12:00:00 1.3 1.1
7 2001-01-02 18:00:00 1.1 1.1
8 2001-01-03 00:00:00 1.1 1.6
9 2001-01-03 06:00:00 1.2 1.6
10 2001-01-03 12:00:00 1.3 1.6
11 2001-01-03 18:00:00 1.1 1.6
I'm sure there are other ways but I didn't want to think about this anymore.

Python pandas DataFrame loc selection for a range of rows and columns

Here is a head() of my DataFrame df:
Temperature DewPoint Pressure
Date
2010-01-01 00:00:00 46.2 37.5 1.0
2010-01-01 01:00:00 44.6 37.1 1.0
2010-01-01 02:00:00 44.1 36.9 1.0
2010-01-01 03:00:00 43.8 36.9 1.0
2010-01-01 04:00:00 43.5 36.8 1.0
I want to select from August 1 to August 15 2010 and display only the Temperature column.
What I am trying to do is:
df.loc[['2010-08-01','2010-08-15'],'Temperature']
But this is throwing me an error.
Generally speaking what I want to learn is how, using loc method I can easily take a range of row i to row k and column j to p and show it in dataframe using loc method:
df.loc[[i:k],[j:p]]
Thank you very much in advance!!!
Steve
I think if you want to be able to pass a slice for the index and columns then you can use ix to achieve this:
In [19]:
df.ix['2010-01-01':, 'DewPoint':]
Out[19]:
DewPoint Pressure
Date
2010-01-01 00:00:00 37.5 1.0
2010-01-01 01:00:00 37.1 1.0
2010-01-01 02:00:00 36.9 1.0
2010-01-01 03:00:00 36.9 1.0
2010-01-01 04:00:00 36.8 1.0
The docs detail numerous ways of selecting data

Time-based .rolling() fails with group by

Here's a code snippet from Pandas Issue #13966
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
Fails:
df.groupby('A').rolling('4s', on='B').C.mean()
ValueError: B must be monotonic
Per the issue linked above, this seems to be a bug. Does anyone have a good workaround?
Set B as the index first so as to use Groupby.resample method on it.
df.set_index('B', inplace=True)
Groupby A and resample based on seconds frequency. As resample cannot be directly used with rolling, use ffill(forward fillna with NaN limit as 0).
Now use rolling function by specifying the window size as 4 (because of freq=4s) interval and take it's mean along C column as shown:
for _, grp in df.groupby('A'):
print (grp.resample('s').ffill(limit=0).rolling(4)['C'].mean().head(10)) #Remove head()
Resulting output obtained:
B
2016-01-01 09:30:00 NaN
2016-01-01 09:30:01 NaN
2016-01-01 09:30:02 NaN
2016-01-01 09:30:03 1.5
2016-01-01 09:30:04 2.5
2016-01-01 09:30:05 3.5
2016-01-01 09:30:06 4.5
2016-01-01 09:30:07 5.5
2016-01-01 09:30:08 6.5
2016-01-01 09:30:09 7.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:00 NaN
2016-01-01 09:30:01 NaN
2016-01-01 09:30:02 NaN
2016-01-01 09:30:03 21.5
2016-01-01 09:30:04 22.5
2016-01-01 09:30:05 23.5
2016-01-01 09:30:06 24.5
2016-01-01 09:30:07 25.5
2016-01-01 09:30:08 26.5
2016-01-01 09:30:09 27.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:12 NaN
2016-01-01 09:30:13 NaN
2016-01-01 09:30:14 NaN
2016-01-01 09:30:15 33.5
2016-01-01 09:30:16 34.5
2016-01-01 09:30:17 35.5
2016-01-01 09:30:18 36.5
2016-01-01 09:30:19 37.5
Freq: S, Name: C, dtype: float64
TL;DR
Use groupby.apply as a workaround instead after setting the index appropriately:
# tested in version - 0.19.1
df.groupby('A').apply(lambda grp: grp.resample('s').ffill(limit=0).rolling(4)['C'].mean())
(Or)
# Tested in OP's version - 0.19.0
df.groupby('A').apply(lambda grp: grp.resample('s').ffill().rolling(4)['C'].mean())
Both work.
>>> df.sort_values('B').set_index('B').groupby('A').rolling('4s').C.mean()

Categories

Resources