pandas resample: forcing specific start time of time bars

pandas resample: forcing specific start time of time bars - python

I have some time series data (pandas.DataFrame) and I resample it in '600S' bars:
import numpy as np
data.resample('600S', level='time').aggregate({'abc':np.sum})
I get something like this:
abc
time
09:30:01.446000 19836
09:40:01.446000 8577
09:50:01.446000 29746
10:00:01.446000 29340
10:10:01.446000 5197
...
How can I force the time bars to start at 09:30:00.000000 instead of at the time of the 1st row in the data? I.e. output should be something like this:
abc
time
09:30:00.000000 *****
09:40:00.000000 ****
09:50:00.000000 *****
10:00:00.000000 *****
10:10:00.000000 ****
...
Thank you for your help!

You can add Series.dt.floor to your code:
df.time = df.time.dt.floor('10 min')
time abc
0 2018-12-05 09:30:00 19836
1 2018-12-05 09:40:00 8577
2 2018-12-05 09:50:00 29746
3 2018-12-05 10:00:00 29340
4 2018-12-05 10:10:00 5197

.resample is a bit of a wildcard. It behaves rather differently with datetime64[ns] and timedelta64[ns] so I personally find it more reliable to work with groupby, when just doing things like .sum or .first.
Sample Data
import pandas as pd
import numpy as np
n = 1000
np.random.seed(123)
df = pd.DataFrame({'time': pd.date_range('2018-01-01 01:13:43', '2018-01-01 23:59:59', periods=n),
'abc': np.random.randint(1,1000,n)})
When the dtype is datetime64[ns] it will resample to "round" bins:
df.dtypes
#time datetime64[ns]
#abc int32
#dtype: object
df.set_index('time').resample('600S').sum()
abc
time
2018-01-01 01:10:00 2572
2018-01-01 01:20:00 2257
2018-01-01 01:30:00 2470
2018-01-01 01:40:00 3131
2018-01-01 01:50:00 3402
With timedelta64[ns] it instead begins the bins based on your first observation:
df['time'] = pd.to_timedelta(df.time.dt.time.astype('str'))
df.dtypes
#time timedelta64[ns]
#abc int32
#dtype: object
df.set_index('time').resample('600S').sum()
abc
time
01:13:43 3432
01:23:43 2447
01:33:43 2588
01:43:43 3202
01:53:43 2547
So in the case of a timedelta64[ns] column, I'd advise you to go with groupby creating bins out of .dt.floor to create your 10 minute bins that go from [XX:00:00 - XX:10:00]
df.groupby(df.time.dt.floor('600S')).sum()
# abc
#time
#01:10:00 2572
#01:20:00 2257
#01:30:00 2470
#01:40:00 3131
#01:50:00 3402
This is the same result we got in the first case with the datetime64[ns] dtype, which binned to the "round" bins.

If your use case is robust to it and you want to extend the time before the actual starting time, a solution is to add an empty row at the starting time you want.
E.g. truncating the first time (df.loc[0] if the index is sorted, else df.index.min()) to its hour (.floor("h")) :
df.loc[df.index.min().floor("h")] = None
df.sort_index(inplace=True) # cleaner, but not even needed
Then resample() will use this time as the starting point (9:00 in OP’s case).
This can also be applied to extend the time range after the actual end of the dataset.

Related

Determining Consecutive Days Using Pandas

I have a Pandas DataFrame that looks something like this:
activity_id start end type ... site heart_rate grp_idx date
user_id ...
72xyy89c74cc57178e02f103187ad579 dcb12345678b5c8e84cf2931b1a553cb 2015-09-12 00:40:33.171000+00:00 2015-09-12 00:53:33.171000+00:00 run ... data\dcb12345678b5c8e84cf2931b1a553cb.json NaN 2015-37 2015-09-12
The DataFrame has already been filtered such that all user_id is the same.
The dtypes are:
activity_id object
start datetime64[ns, UTC]
end datetime64[ns, UTC]
type object
distance float64
steps float64
speed float64
pace float64
calories float64
ascent float64
descent float64
site object
heart_rate float64
grp_idx object
date object
I need to determine if there are x (e.g. 4) consecutive days in a row and find the number of times that has occurred. For example:
2015-09-12
2015-09-13
2015-09-14
2015-09-15
2015-09-16
2015-09-17
2015-09-18
2015-09-19
Would count as two.
I have tried using groupby, e.g.:
s = repeat_runner_df.groupby('user_id').start.diff().dt.days.fillna(1).cumsum()
repeat_runner_df.groupby(['user_id', s]).filter(lambda x: len(x) < 3)
print(repeat_runner_df)
but that has gotten me nowhere and neither has my Google skills. Any help would be greatly appreciated.
c7e962db02da12345f02fe3d8a86c99d 2018-04-08 03:16:06+00:00 2018-04-08 03:41:30+00:00 run ... NaN 2018-14 2018-04-08
c7e962db02da12345f02fe3d8a86c99d 2018-04-09 17:21:37+00:00 2018-04-09 18:27:17+00:00 run ... NaN 2018-14 2018-04-09
c7e962db02da12345f02fe3d8a86c99d 2018-04-10 19:05:39+00:00 2018-04-10 19:38:32+00:00 run ... NaN 2018-15 2018-04-10

Try this:
count = df.groupby(df.start.diff().ne(pd.Timedelta(days=1)).cumsum()).apply(len).ge(4).sum()
Output:
>>> count
3

I think this is what youre looking for:
consecutive = 4
(df.groupby(['user_id',df.groupby('user_id')['start']
.transform(lambda x: x.diff().dt.days.ne(1).cumsum())])['user_id']
.cumcount().add(1).eq(consecutive).sum())

Pandas: how to convert an incomplete date-only index to hourly index

I have an ideal hourly time series that looks like this:
from pandas import Series, date_range, Timestamp
from numpy import random
index = date_range(
"2020-01-26 14:00:00",
"2020-11-28 02:00:00",
freq="H",
tz="Europe/Madrid",
)
ideal = Series(index=index, data=random.rand(len(index)))
ideal
2020-01-26 14:00:00+01:00 0.186026
2020-01-26 15:00:00+01:00 0.142096
2020-01-26 16:00:00+01:00 0.432625
2020-01-26 17:00:00+01:00 0.373805
2020-01-26 18:00:00+01:00 0.377718
...
2020-11-27 22:00:00+01:00 0.961327
2020-11-27 23:00:00+01:00 0.440274
2020-11-28 00:00:00+01:00 0.996126
2020-11-28 01:00:00+01:00 0.607873
2020-11-28 02:00:00+01:00 0.122993
Freq: H, Length: 7357, dtype: float64
The actual, non-ideal time series is far from perfect:
It is not complete (i.e.: some hourly values are missing)
Only the date is stored, not the hour
Something like this:
actual = ideal.drop([
Timestamp("2020-01-28 01:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 15:00:00", tz="Europe/Madrid"),
Timestamp("2020-08-02 16:00:00", tz="Europe/Madrid"),
])
actual.index = actual.index.date
actual
2020-01-26 0.186026
2020-01-26 0.142096
2020-01-26 0.432625
2020-01-26 0.373805
2020-01-26 0.377718
...
2020-11-27 0.961327
2020-11-27 0.440274
2020-11-28 0.996126
2020-11-28 0.607873
2020-11-28 0.122993
Length: 7355, dtype: float64
Now I would like to convert the actual time series into something as close as possible to the ideal. That means:
The resulting series has a full hourly index (i.e.: no missing values)
NaNs are allowed if there is no way to fill an hour (i.e.: it is missing in the actual time series)
Misalignment within a day with respect to the ideal time series is expected only in those days with missing data
Is there an efficient way to do this? I would like to avoid having to iterate since I am guessing that would be very inefficient.
With efficient I am looking for a fast (CPU wall time) implementation that relies only on Python/Pandas/NumPy (no Cython or Numba allowed).

You can use groupby().cumcount() to represent the hour within a day then reindex:
s = pd.to_datetime(actual.index).tz_localize("Europe/Madrid").to_series()
actual.index = s + s.groupby(level=0).cumcount() * pd.Timedelta('1H')
new_idx = pd.date_range(actual.index.min(),actual.index.max(), freq='H')
actual = actual.reindex(new_idx)

Pandas: How to group by a datetime column, using only the time and discarding the date

I have a dataframe with a datetime column. I want to group by the time component only and aggregate, e.g. by taking the mean.
I know that I can use pd.Grouper to group by date AND time, but it doesn't work on time only.
Say we have the following dataframe:
import numpy as np
import pandas as pd
drange = pd.date_range('2019-08-01 00:00', '2019-08-12 12:00', freq='1T')
time = drange.time
c0 = np.random.rand(len(drange))
c1 = np.random.rand(len(drange))
df = pd.DataFrame(dict(drange=drange, time=time, c0=c0, c1=c1))
print(df.head())
drange time c0 c1
0 2019-08-01 00:00:00 00:00:00 0.031946 0.159739
1 2019-08-01 00:01:00 00:01:00 0.809171 0.681942
2 2019-08-01 00:02:00 00:02:00 0.036720 0.133443
3 2019-08-01 00:03:00 00:03:00 0.650522 0.409797
4 2019-08-01 00:04:00 00:04:00 0.239262 0.814565
In this case, the following throws a TypeError:
grouper = pd.Grouper(key='time', freq='5T')
grouped = df.groupby(grouper).mean()
I could set key=drange to group by date and time and then:
Reset the index
Transform the new column to float
Bin with pd.cut
Cast back to time
Finally group-by and then aggregate
... But I wonder whether there is a cleaner way to achieve the same results.

Series.dt.time/DatetimeIndex.time returns the time as datetime.time. This isn't great because pandas works best withtimedelta64 and so your 'time' column is cast to object, losing all datetime functionality.
You can subtract off the normalized date to obtain the time as a timedelta so you can continue to use the datetime tools of pandas. You can floor this to group.
s = (df.drange - df.drange.dt.normalize()).dt.floor('5T')
df.groupby(s).mean()
c0 c1
drange
00:00:00 0.436971 0.530201
00:05:00 0.441387 0.518831
00:10:00 0.465008 0.478130
... ... ...
23:45:00 0.523233 0.515991
23:50:00 0.468695 0.434240
23:55:00 0.569989 0.510291
Alternatively if you feel unsure of floor, this gets the identical output up to the index name
df['time'] = (df.drange - df.drange.dt.normalize()) # timedelta64[ns]
df.groupby(pd.Grouper(key='time', freq='5T')).mean()

When you use DataFrame.groupby you can a Series an argument. Moreover, if your series is a datetime, you can use the series.dt to access the properties of date. In your case df['drange'].dt.hour or df['drange'].dt.time should do it.
# df['drange']=pd.to_datetime(df['drange'])
df.groupby(df['drange'].dt.hour).agg(...)

pandas asfreq returns NaN if exact date DNE

Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
Out[565]:
Date
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
...
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
Out[566]:
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
...
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
...
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
....
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
...
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
Out[15]:
Date
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
...
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
...
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Regards,

Assuming the input is a dataframe Series , first do
import pandas as pd
fin_series.resample("q",pd.Series.last_valid_index)
to get a series with the last non-NA index for each quarter. Then
fin_series.resample("q","last")
for the last non-NA value. You can then join these together. As you suggested in your comment:
fin_series[fin_series.resample("q",pd.Series.last_valid_index)]

df.asfreq('d').interpolate().asfreq('q')

How to remove the date information in a column, just keep time

I am using pandas dataframe. there is a specific column has time information.
the raw data likes this:
5:15am
5:28am
6:15am
so I need to convert the raw data into datetime format:
format = '%I:%M%p'
dataset['TimeStamp'] = pd.to_datetime(dataset['TimeStamp'],format)
However, I got:
2014-07-04 05:15:00
2014-07-04 05:28:00
2014-07-04 06:15:00
I don't want the year and date information, just want time. How can I remove it. Thanks.

Since version 0.17.0 you can just do
dataset['TimeStamp'].dt.time
For versions older than 0.17.0:
You can just call apply and access the time function on the datetime object create the column initially like this without the need for post processing:
In [143]:
dataset['TimeStamp'] = pd.to_datetime(dataset['TimeStamp'],format).apply(lambda x: x.time())
dataset
Out[143]:
TimeStamp
0 05:15:00
1 05:28:00
2 06:15:00

The following will convert what you have to datetime.time() objects:
dataset['TimeStamp'] = pd.Series([val.time() for val in dataset['TimeStamp']])
Output
TimeStamp
0 05:15:00
1 05:28:00
2 06:15:00

Just use the datetime.time() function
datetime.time()
Return time object with same hour, minute, second and microsecond. tzinfo is None. See also method timetz().
This will return a datetime.time object and you can access the data with the time.hour time.minute and time.second attributes.

your_date_df.dt.time
Lets say that your column with the date ans time is df['arrived_date']:
0 2015-01-06 00:43:00
1 2015-01-06 07:56:00
2 2015-01-06 11:02:00
3 2015-01-06 11:22:00
4 2015-01-06 15:27:00
Name: arrived_date, dtype: datetime64[ns]
Whith pandas, you just need to do:
df['arrived_time']=df['arrived_date'].dt.time
The new column df['arrived_time'] will look like this:
0 00:43:00
1 07:56:00
2 11:02:00
3 11:22:00
4 15:27:00
Name: arrived_time, dtype: object
Observe that the new column, df['arrived_time'], is no longer a datetime64 type, the type of the column is just a pandas object

There's a simpler way to do it using pandas, although most, if not all solutions are correct
df.TimeStamp = pd.to_datetime(df.TimeStamp).dt.strftime('%H:%M')

dataset['TimeStamp']=dataset['TimeStamp'].str.slice(11,18)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas resample: forcing specific start time of time bars - python

You can add Series.dt.floor to your code: df.time = df.time.dt.floor('10 min') time abc 0 2018-12-05 09:30:00 19836 1 2018-12-05 09:40:00 8577 2 2018-12-05 09:50:00 29746 3 2018-12-05 10:00:00 29340 4 2018-12-05 10:10:00 5197

Related

Determining Consecutive Days Using Pandas

Pandas: how to convert an incomplete date-only index to hourly index

Pandas: How to group by a datetime column, using only the time and discarding the date

pandas asfreq returns NaN if exact date DNE

How to remove the date information in a column, just keep time

Categories

Resources