I am trying to resample a datetime index into hourly data. I also want the resampling until the end of the month.
So given the following df:
data = np.arange(6).reshape(3,2)
rng = ['Jan-2016', 'Feb-2016', 'Mar-2016']
df = pd.DataFrame(data, index=rng)
df.index = pd.to_datetime(df.index)
0 1
2016-01-01 0 1
2016-02-01 2 3
2016-03-01 4 5
I know I can resample this into an hourly index by: df = df.resample('H').ffill() However, when I call the df it gets cut at 2016-03-01. I am essentially making the index run from 1/1/2016 to 3/31/2016 with an hourly granularity.
How can I extend this to the end of the month 2015-03-31 given that the last index is the beginning of the month.
UPDATE:
In [37]: (df.set_index(df.index[:-1].union([df.index[-1] + pd.offsets.MonthEnd(0)]))
....: .resample('H')
....: .ffill()
....: .head()
....: )
Out[37]:
0 1
2016-01-01 00:00:00 0 1
2016-01-01 01:00:00 0 1
2016-01-01 02:00:00 0 1
2016-01-01 03:00:00 0 1
2016-01-01 04:00:00 0 1
In [38]: (df.set_index(df.index[:-1].union([df.index[-1] + pd.offsets.MonthEnd(0)]))
....: .resample('H')
....: .ffill()
....: .tail()
....: )
Out[38]:
0 1
2016-03-30 20:00:00 2 3
2016-03-30 21:00:00 2 3
2016-03-30 22:00:00 2 3
2016-03-30 23:00:00 2 3
2016-03-31 00:00:00 4 5
Explanation:
In [40]: df.index[-1] + pd.offsets.MonthEnd(0)
Out[40]: Timestamp('2016-03-31 00:00:00')
In [41]: df.index[:-1].union([df.index[-1] + pd.offsets.MonthEnd(0)])
Out[41]: DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-31'], dtype='datetime64[ns]', freq=None)
Old incorrect answer:
In [77]: df.resample('M').ffill().resample('H').ffill().tail()
Out[77]:
0 1
2016-03-30 20:00:00 2 3
2016-03-30 21:00:00 2 3
2016-03-30 22:00:00 2 3
2016-03-30 23:00:00 2 3
2016-03-31 00:00:00 4 5
Maybe it's late for this, but I think this way is easier:
import pandas as pd
import numpy as np
data = np.arange(6).reshape(3,2)
rng = ['Jan-2016', 'Feb-2016', 'Mar-2016']
df = pd.DataFrame(data, index=rng)
df.index = pd.to_datetime(df.index)
# Create the desired time range
t_index = pd.DatetimeIndex(pd.date_range(start='2016-01-01', end='2016-12-31', freq='h'))
# Resample
df_rsmpld = df.reindex(t_index, method='ffill')
Related
Given a dataframe with observations how can rows be returned which are within +-X days of a given list of dates?
I came up with the following function, but is there a simpler more efficient way of achieving the task?
import numpy as np
from numpy.random import RandomState
def filterDfByDates(df, dates_of_observations, date_range):
"""
Extract all rows in the dataframe which fall between any date in the dates_of_observation +- date_range range
"""
##Build mask
mask = np.full(df.shape[0],False)
for query_date in dates_of_observations:
min_day = query_date - date_range
max_day = query_date + date_range
mask |= ( (df.index >= min_day) & (df.index <= max_day) )
return df[mask]
rand = RandomState(17)
dates : np.ndarray = rand.choice(a=np.arange(np.datetime64('2021-01-01'),
np.datetime64('2021-01-15'), np.timedelta64(1, 'h')),size= 30, replace=True)
dates.sort()
randData = rand.choice([True, False], len(dates), p=[0.1, 0.9])
df = pd.DataFrame({"event": randData},
index=dates)
dates_of_obs = df.query("event").index
filterDfByDates(df,dates_of_obs, np.timedelta64(1,'D'))
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
date,event
2012-01-01 12:30:00,event1
2012-01-01 12:30:12,event2
2012-01-01 12:30:12,event3
2012-01-02 12:28:29,event4
2012-02-01 12:30:29,event4
2012-02-01 12:30:38,event5
2012-03-01 12:31:05,event6
2012-03-01 12:31:38,event7
2012-06-01 12:31:44,event8
2012-07-01 10:31:48,event9
2012-07-01 11:32:23,event10"""))
>>> df['date'] = pd.to_datetime(df['date'], format="%Y-%m-%d %H:%M:%S.%f")
>>> df
date event
0 2012-01-01 12:30:00 event1
1 2012-01-01 12:30:12 event2
2 2012-01-01 12:30:12 event3
3 2012-01-02 12:28:29 event4
4 2012-02-01 12:30:29 event4
5 2012-02-01 12:30:38 event5
6 2012-03-01 12:31:05 event6
7 2012-03-01 12:31:38 event7
8 2012-06-01 12:31:44 event8
9 2012-07-01 10:31:48 event9
10 2012-07-01 11:32:23 event10
First, we start by shifting the date column and substract it to the original date column :
>>> g = df['date'].sub(df['date'].shift(1)).dt.days
>>> g
0 NaN
1 0.0
2 0.0
3 0.0
4 30.0
5 0.0
6 29.0
7 0.0
8 92.0
9 29.0
10 0.0
Name: date, dtype: float64
Then, we apply a cumsum for all values greater than X (here it is 1 day) to get the expect result :
>>> X = 1
>>> df.groupby(g.gt(X).cumsum()).apply(print)
date event
0 2012-01-01 12:30:00 event1
1 2012-01-01 12:30:12 event2
2 2012-01-01 12:30:12 event3
3 2012-01-02 12:28:29 event4
date event
4 2012-02-01 12:30:29 event4
5 2012-02-01 12:30:38 event5
date event
6 2012-03-01 12:31:05 event6
7 2012-03-01 12:31:38 event7
date event
8 2012-06-01 12:31:44 event8
date event
9 2012-07-01 10:31:48 event9
10 2012-07-01 11:32:23 event10
I am trying to assign a Column to an existing df. Specifically, certain timestamps get sorted but the current export is a separate series. I'd like to append this to the df.
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
'code' : ['A','B','C','A','B','C','A'],
})
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
Output:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I've altered
df['time'] = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
But this produces
time code
0 08:00:00 A
1 12:00:00 B
2 13:00:00 C
3 16:00:00 A
4 20:00:00 B
5 26:00:00 C
6 27:00:00 A
As you can see. The timestamps aren't aligned with their respective values after sorting.
The intended output is:
time code
0 08:00:00 A
1 12:00:00 B
2 13:00:00 C
3 16:00:00 C
4 20:00:00 A
5 26:00:00 B
6 27:00:00 A
Remove reset_index(drop=True) from your code and sort later may work for you.
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
'code' : ['A','B','C','A','B','C','A'],
})
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
x = df.time.apply(lambda x: x if x > cutoff else x + day).dt.components
df['time'] = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
df = df.sort_values('time')
print(df)
Pandas do alignment via index. reset_index(drop=True) destructed the original index and caused the sorted time column assigned back sequentially. This is probably why you didn't get what you what.
original time column.
0 08:00:00
1 12:00:00
2 16:00:00
3 20:00:00
4 02:00:00
5 13:00:00
6 03:00:00
after sort_values().
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
after reset_index(drop=True)
0 02:00:00
1 03:00:00
2 08:00:00
3 12:00:00
4 13:00:00
5 16:00:00
6 20:00:00
I hope this is what you want:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
'code' : ['A','B','C','A','B','C','A'],
})
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
print(df)
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
df['time'] = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
print(df)
I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')
I start with the following pandas dataframe, I wish to group each day, and make a new column called 'label', which labels the group with a sequential number. How do I do this?
df = pd.DataFrame({'val': [10,40,30,10,11,13]}, index=pd.date_range('2016-01-01 00:00:00', periods=6, freq='12H' ) )
# df['label'] = df.groupby(pd.TimeGrouper('D')) # what do i do here???
print df
output:
val
2016-01-01 00:00:00 10
2016-01-01 12:00:00 40
2016-01-02 00:00:00 30
2016-01-02 12:00:00 10
2016-01-03 00:00:00 11
2016-01-03 12:00:00 13
desired output:
val label
2016-01-01 00:00:00 10 1
2016-01-01 12:00:00 40 1
2016-01-02 00:00:00 30 2
2016-01-02 12:00:00 10 2
2016-01-03 00:00:00 11 3
2016-01-03 12:00:00 13 3
Try this:
df = pd.DataFrame({'val': [10,40,30,10,11,13]}, index=pd.date_range('2016-01-01 00:00:00', periods=6, freq='12H' ) )
If you just want to group by date:
df['label'] = df.groupby(df.index.date).grouper.group_info[0] + 1
print(df)
To group by time more generally, you can use TimeGrouper:
df['label'] = df.groupby(pd.TimeGrouper('D')).grouper.group_info[0] + 1
print(df)
Both of the above should give you the following:
val label
2016-01-01 00:00:00 10 1
2016-01-01 12:00:00 40 1
2016-01-02 00:00:00 30 2
2016-01-02 12:00:00 10 2
2016-01-03 00:00:00 11 3
2016-01-03 12:00:00 13 3
I think this is undocumented (or hard to find, at least). Check out:
Get group id back into pandas dataframe
for more discussion.
maybe a more simpler and intuitive approach is this:
df['label'] = df.groupby(df.index.day).keys