Generating list of 5 minute interval between two times - python

I have the following strings:
start = "07:00:00"
end = "17:00:00"
How can I generate a list of 5 minute interval between those times, ie
["07:00:00","07:05:00",...,"16:55:00","17:00:00"]

This works for me, I'm sure you can figure out how to put the results in the list instead of printing them out:
>>> import datetime
>>> start = "07:00:00"
>>> end = "17:00:00"
>>> delta = datetime.timedelta(minutes=5)
>>> start = datetime.datetime.strptime( start, '%H:%M:%S' )
>>> end = datetime.datetime.strptime( end, '%H:%M:%S' )
>>> t = start
>>> while t <= end :
... print datetime.datetime.strftime( t, '%H:%M:%S')
... t += delta
...
07:00:00
07:05:00
07:10:00
07:15:00
07:20:00
07:25:00
07:30:00
07:35:00
07:40:00
07:45:00
07:50:00
07:55:00
08:00:00
08:05:00
08:10:00
08:15:00
08:20:00
08:25:00
08:30:00
08:35:00
08:40:00
08:45:00
08:50:00
08:55:00
09:00:00
09:05:00
09:10:00
09:15:00
09:20:00
09:25:00
09:30:00
09:35:00
09:40:00
09:45:00
09:50:00
09:55:00
10:00:00
10:05:00
10:10:00
10:15:00
10:20:00
10:25:00
10:30:00
10:35:00
10:40:00
10:45:00
10:50:00
10:55:00
11:00:00
11:05:00
11:10:00
11:15:00
11:20:00
11:25:00
11:30:00
11:35:00
11:40:00
11:45:00
11:50:00
11:55:00
12:00:00
12:05:00
12:10:00
12:15:00
12:20:00
12:25:00
12:30:00
12:35:00
12:40:00
12:45:00
12:50:00
12:55:00
13:00:00
13:05:00
13:10:00
13:15:00
13:20:00
13:25:00
13:30:00
13:35:00
13:40:00
13:45:00
13:50:00
13:55:00
14:00:00
14:05:00
14:10:00
14:15:00
14:20:00
14:25:00
14:30:00
14:35:00
14:40:00
14:45:00
14:50:00
14:55:00
15:00:00
15:05:00
15:10:00
15:15:00
15:20:00
15:25:00
15:30:00
15:35:00
15:40:00
15:45:00
15:50:00
15:55:00
16:00:00
16:05:00
16:10:00
16:15:00
16:20:00
16:25:00
16:30:00
16:35:00
16:40:00
16:45:00
16:50:00
16:55:00
17:00:00
>>>

Try:
# import modules
from datetime import datetime, timedelta
# Create starting and end datetime object from string
start = datetime.strptime("07:00:00", "%H:%M:%S")
end = datetime.strptime("17:00:00", "%H:%M:%S")
# min_gap
min_gap = 5
# compute datetime interval
arr = [(start + timedelta(hours=min_gap*i/60)).strftime("%H:%M:%S")
for i in range(int((end-start).total_seconds() / 60.0 / min_gap))]
print(arr)
# ['07:00:00', '07:05:00', '07:10:00', '07:15:00', '07:20:00', '07:25:00', '07:30:00', ..., '16:55:00']
Explanations:
First, you need to convert string date to datetime object. The strptime does it!
Then, we will find the number of minutes between the starting date and the ending datetime. This discussion solved it! We can do it like this :
(end-start).total_seconds() / 60.0
However, in our case, we only want to iterate every n minutes. So, in our loop, we need to divide it by n.
Also, as we will iterate over this number of minutes, we need to convertit to int for the for loop. That results in:
int((end-start).total_seconds() / 60.0 / min_gap)
Then, on each element of our loop, we will add the number of minutes to the initial datetime. The tiemdelta function has been designed for. As parameter, we specify the number of hours we want to add : min_gap*i/60.
Finally, we convert this datetime object back to a string object using the strftime.

Related

How to reset a date time column to be in increments of one minute in python

I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?
you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.
First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00

getting tzinfo from a column rather than a single string

Usually you can localize a whole column with tz_localize. If you specify the single value timezone you want the column localized to formatted. How would you do this when there is a column of timezones?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.tz_localize.html#pandas.Series.dt.tz_localize
start_datetime timezone
1 2016-08-25 10:30:00 US/Pacific
2 2006-08-26 14:00:00 US/Pacific
3 2016-08-27 10:15:00 US/Eastern
4 2016-08-28 10:30:00 US/Central
5 2016-08-09 17:45:00 US/Central
Is there a way to do this without lambdas or apply? (best option).
We can group by timezone and apply .dt.tz_localize(group_timezone) to each group:
In [393]: df['new'] = \
df.groupby('timezone')['start_datetime'] \
.transform(lambda x: x.dt.tz_localize(x.name))
In [394]: df
Out[394]:
start_datetime timezone new
0 2016-08-25 10:30:00 US/Pacific 2016-08-25 17:30:00
1 2006-08-26 14:00:00 US/Pacific 2006-08-26 21:00:00
2 2016-08-27 10:15:00 US/Eastern 2016-08-27 14:15:00
3 2016-08-28 10:30:00 US/Central 2016-08-28 15:30:00
4 2016-08-09 17:45:00 US/Central 2016-08-09 22:45:00

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

How to resample a df with datetime index to exactly n equally sized periods?

I've got a large dataframe with a datetime index and need to resample data to exactly 10 equally sized periods.
So far, I've tried finding the first and last dates to determine the total number of days in the data, divide that by 10 to determine the size of each period, then resample using that number of days. eg:
first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'
df.resample(periodsize,how='sum')
This doesn't guarantee exactly 10 periods in the df after resampling since the periodsize is a rounded down int. Using a float doesn't work in the resampling. Seems that either there's something simple that I'm missing here, or I'm attacking the problem all wrong.
import numpy as np
import pandas as pd
n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
# 0
# 2000-01-01 1
# 2000-01-02 1
# ...
# 2000-02-01 1
# 2000-02-02 1
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)
result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n
yields
0
2000-01-01 00:00:00 4
2000-01-04 07:12:00 3
2000-01-07 14:24:00 3
2000-01-10 21:36:00 4
2000-01-14 04:48:00 3
2000-01-17 12:00:00 3
2000-01-20 19:12:00 4
2000-01-24 02:24:00 3
2000-01-27 09:36:00 3
2000-01-30 16:48:00 3
The values in the 0-column indicate the number of rows that were aggregated, since the original DataFrame was filled with values of 1. The pattern of 4's and 3's is about as even as you can get since 33 rows can not be evenly grouped into 10 groups.
Explanation: Consider this simpler DataFrame:
n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
# 0
# 2000-01-01 1
# 2000-01-02 1
# 2000-01-03 1
# 2000-01-04 1
# 2000-01-05 1
Using df.resample('2D', how='sum') gives the wrong number of groups
In [366]: df.resample('2D', how='sum')
Out[366]:
0
2000-01-01 2
2000-01-03 2
2000-01-05 1
Using df.resample('3D', how='sum') gives the right number of groups, but the
second group starts at 2000-01-04 which does not evenly divide the DataFrame
into two equally-spaced groups:
In [367]: df.resample('3D', how='sum')
Out[367]:
0
2000-01-01 3
2000-01-04 2
To do better, we need to work at a finer time resolution than in days. Since Timedeltas have a total_seconds method, let's work in seconds. So for the example above, the desired frequency string would be
In [374]: df.resample('216000S', how='sum')
Out[374]:
0
2000-01-01 00:00:00 3
2000-01-03 12:00:00 2
since there are 216000*2 seconds in 5 days:
In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0
Okay, so now all we need is a way to generalize this. We'll want the minimum and maximum dates in the index:
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
We add an extra day because it makes the difference in days come out right. In
the example above, There are only 4 days between the Timestamps for 2000-01-05
and 2000-01-01,
In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4
But as we can see in the worked example, the DataFrame has 5 rows representing 5
days. So it makes sense that we need to add an extra day.
Now we can compute the correct number of seconds in each equally-spaced group with:
secs = int((last-first).total_seconds()//n)
Here is one way to ensure equal-size sub-periods by using np.linspace() on pd.Timedelta and then classifying each obs into different bins using pd.cut.
import pandas as pd
import numpy as np
# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))
Out[87]:
A B
2015-01-01 00:00:00 1.7641 0.4002
2015-01-01 08:00:00 0.9787 2.2409
2015-01-01 16:00:00 1.8676 -0.9773
2015-01-02 00:00:00 0.9501 -0.1514
2015-01-02 08:00:00 -0.1032 0.4106
2015-01-02 16:00:00 0.1440 1.4543
2015-01-03 00:00:00 0.7610 0.1217
2015-01-03 08:00:00 0.4439 0.3337
2015-01-03 16:00:00 1.4941 -0.2052
2015-01-04 00:00:00 0.3131 -0.8541
2015-01-04 08:00:00 -2.5530 0.6536
2015-01-04 16:00:00 0.8644 -0.7422
2015-01-05 00:00:00 2.2698 -1.4544
2015-01-05 08:00:00 0.0458 -0.1872
2015-01-05 16:00:00 1.5328 1.4694
... ... ...
2015-01-29 08:00:00 0.9209 0.3187
2015-01-29 16:00:00 0.8568 -0.6510
2015-01-30 00:00:00 -1.0342 0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555 0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00 0.6252 -1.6021
2015-02-01 00:00:00 -1.1044 0.0522
2015-02-01 08:00:00 -0.7396 1.5430
2015-02-01 16:00:00 -1.2929 0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00 0.5233 -0.1715
2015-02-02 16:00:00 0.7718 0.8235
2015-02-03 00:00:00 2.1632 1.3365
[100 rows x 2 columns]
# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])
# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])
Out[89]:
A B start_time_index end_time_index
2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00 1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00 0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032 0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00 0.1440 1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00 0.7610 0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00 0.4439 0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00 1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00 0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00 0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00 2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00 0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00 1.5328 1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
... ... ... ... ...
2015-01-29 08:00:00 0.9209 0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00 0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342 0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555 0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00 0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044 0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396 1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929 0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00 0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00 0.7718 0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00 2.1632 1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00
[100 rows x 4 columns]
df.groupby('start_time_index').agg('sum')
Out[90]:
A B
start_time_index
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -0.0902 -2.5178
Another potential shorter way to do this is to specify your sampling freq as the time delta. But the problem, as shown in below, is that it delivers 11 sub-samples instead of 10. I believe the reason is that the resample implements a left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub-sampling scheme so that the very last obs at '2015-02-03 00:00:00' is considered as a separate group. If we use pd.cut to do it ourself, we can specify include_lowest=True so that it gives us exactly 10 sub-samples rather than 11.
n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')
Out[114]:
A B
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00 2.1632 1.3365

Pandas filtering values in dataframe

I have this dataframe. The columns represent the highs and the lows in daily EURUSD price:
df.low df.high
2013-01-17 16:00:00 1.33394 2013-01-17 20:00:00 1.33874
2013-01-18 18:00:00 1.32805 2013-01-18 09:00:00 1.33983
2013-01-21 00:00:00 1.32962 2013-01-21 09:00:00 1.33321
2013-01-22 11:00:00 1.32667 2013-01-22 09:00:00 1.33715
2013-01-23 17:00:00 1.32645 2013-01-23 14:00:00 1.33545
2013-01-24 10:00:00 1.32860 2013-01-24 18:00:00 1.33926
2013-01-25 04:00:00 1.33497 2013-01-25 17:00:00 1.34783
2013-01-28 10:00:00 1.34246 2013-01-28 16:00:00 1.34771
2013-01-29 13:00:00 1.34143 2013-01-29 21:00:00 1.34972
2013-01-30 08:00:00 1.34820 2013-01-30 21:00:00 1.35873
2013-01-31 13:00:00 1.35411 2013-01-31 17:00:00 1.35944
I summed them up into a third column (df.extremes).
df.extremes
2013-01-17 16:00:00 1.33394
2013-01-17 20:00:00 1.33874
2013-01-18 18:00:00 1.32805
2013-01-18 09:00:00 1.33983
2013-01-21 00:00:00 1.32962
2013-01-21 09:00:00 1.33321
2013-01-22 09:00:00 1.33715
2013-01-22 11:00:00 1.32667
2013-01-23 14:00:00 1.33545
2013-01-23 17:00:00 1.32645
2013-01-24 10:00:00 1.32860
2013-01-24 18:00:00 1.33926
2013-01-25 04:00:00 1.33497
2013-01-25 17:00:00 1.34783
2013-01-28 10:00:00 1.34246
2013-01-28 16:00:00 1.34771
2013-01-29 13:00:00 1.34143
2013-01-29 21:00:00 1.34972
2013-01-30 08:00:00 1.34820
2013-01-30 21:00:00 1.35873
2013-01-31 13:00:00 1.35411
2013-01-31 17:00:00 1.35944
But now i want to filter some values from df.extremes.
To explain what to filter i try with this "pseudocode":
IF following the index we move from: previous df.low --> df.low --> df.high:
IF df.low > previous df.low: delete df.low
IF df.low < previous df.low: delete previous df.low
If i try to work this out with a for loop, it gives me a KeyError: 1.3339399999999999.
day = df.groupby(pd.TimeGrouper('D'))
is_day_min = day.extremes.apply(lambda x: x == x.min())
for i in df.extremes:
if is_day_min[i] == True and is_day_min[i+1] == True:
if df.extremes[i] > df.extremes[i+1]:
del df.extremes[i]
for i in df.extremes:
if is_day_min[i] == True and is_day_min[i+1] == True:
if df.extremes[i] < df.extremes[i+1]:
del df.extremes[i+1]
How to filter/delete the values as i explained in pseudocode?
I am struggling with indexing and bools but i can't solve this. I strongly suspect that i need to use a lambda function, but i don't know how to apply it. So please have mercy it's too long that i'm trying on this. Hope i've been clear enough.
All you're really missing is a way of saying "previous low" in a vectorized fashion. That's spelled df['low'].shift(-1). Once you have that it's just:
prev = df.low.shift(-1)
filtered_df = df[~((df.low > prev) | (df.low < prev))]

Categories

Resources