I have two columns startDate and endDate
I need to calculate number of hours count from 0 to 23 between these dates
Example, start date is 2000-12-05 10:00:00 and end date is 2001-01-15 15:00:00
I need to calculate how many times hour 0 to 23 occurred between these two dates in python
I took the difference between the dates and calculated hours from the difference.
After which I plan to extract start hour from startDate till startDateHour * hours to get the endHour
and iterate through a dictionary to increase the count, but is there any other approach with which I can do this?
df['diff'] = df['endDate'] - df['startDate']
df['hours']= df['diff'] / np.timedelta64(1, 'h')
from datetime import datetime
X = (datetime.strptime(2020-01-05 01:19:49, '%Y-%m-%d %h:%m:%s') -
datetime.strptime(2020-01-02 06:12:44, '%Y-%m-%d %h:%m:%s'))
print(X)
You can do:
>>> df['diff'] = df['endDate'] - df['startDate']
>>> df['hours'] = df['diff'].dt.components.hours
Considering these are pd.Timedelta objects.
>>> idx = pd.date_range('2018-01-01', periods=5, freq='H')
>>> df = pd.DataFrame({'ts':ts, 'ts_2':ts + pd.Timedelta(hours=1)})
>>> df
ts ts_2
0 2018-01-01 00:00:00 2018-01-01 01:00:00
1 2018-01-01 01:00:00 2018-01-01 02:00:00
2 2018-01-01 02:00:00 2018-01-01 03:00:00
3 2018-01-01 03:00:00 2018-01-01 04:00:00
4 2018-01-01 04:00:00 2018-01-01 05:00:00
>>> df['hour'] = (df['ts_2'] - df['ts']).dt.components.hours
>>> df
ts ts_2 hour
0 2018-01-01 00:00:00 2018-01-01 01:00:00 1
1 2018-01-01 01:00:00 2018-01-01 02:00:00 1
2 2018-01-01 02:00:00 2018-01-01 03:00:00 1
3 2018-01-01 03:00:00 2018-01-01 04:00:00 1
4 2018-01-01 04:00:00 2018-01-01 05:00:00 1
Related
I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]
I have imported data from some source that has date in datatype class 'object' and hour in integer and looks something like:
Date Hour Val
2019-01-01 1 0
2019-01-01 2 0
2019-01-01 3 0
2019-01-01 4 0
2019-01-01 5 0
2019-01-01 6 0
2019-01-01 7 0
2019-01-01 8 0
I need a single column that has the date-time in a column that looks like this:
DATETIME
2019-01-01 01:00:00
2019-01-01 02:00:00
2019-01-01 03:00:00
2019-01-01 04:00:00
2019-01-01 05:00:00
2019-01-01 06:00:00
2019-01-01 07:00:00
2019-01-01 08:00:00
I tried to convert the date column to dateTime format using
pd.datetime(df.Date)
and then using
df.Date.dt.hour = df.Hour
I get the error
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there an easy way to do this?
Use pandas.to_timedelta and pandas.to_datetime:
# if needed
df['Date'] = pd.to_datetime(df['Date'])
df['Datetime'] = df['Date'] + pd.to_timedelta(df['Hour'], unit='H')
[out]
Date Hour Val Datetime
0 2019-01-01 1 0 2019-01-01 01:00:00
1 2019-01-01 2 0 2019-01-01 02:00:00
2 2019-01-01 3 0 2019-01-01 03:00:00
3 2019-01-01 4 0 2019-01-01 04:00:00
4 2019-01-01 5 0 2019-01-01 05:00:00
5 2019-01-01 6 0 2019-01-01 06:00:00
6 2019-01-01 7 0 2019-01-01 07:00:00
7 2019-01-01 8 0 2019-01-01 08:00:00
Since you asked for a method combining the columns and using a single pd.to_datetime call, you could do:
df['Datetime'] = pd.to_datetime((df['Date'].astype(str) + ' ' +
df['Hour'].astype(str)),
format='%Y-%m-%d %I')
My DF looks like this:
date Open
2018-01-01 00:00:00 1.0536
2018-01-01 00:01:00 1.0527
2018-01-01 00:02:00 1.0558
2018-01-01 00:03:00 1.0534
2018-01-01 00:04:00 1.0524
The above DF is a minute based. What I want to do is create a new DF2 which is a day based by selecting a single time value from the day.
For example, I want to select the 02:00:00 every day then my new DF2 will look like this:
date Open
2018-01-01 02:00:00 1.0332
2018-01-02 02:00:00 1.0423
2018-01-03 02:00:00 1.0252
2018-01-04 02:00:00 1.0135
2018-01-05 02:00:00 1.0628
....
Now our DF2 is a day based and not minute
What did I do?
I selected the day with this dt method.
df2 = df.groupby(df.date.dt.date,sort=False).Open.dt.hour.between(2, 2)
However, it does not work.
# Sample data.
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('2019-01-01', '2019-01-10', freq='1Min'),
'Open': 1 + np.random.randn(12961) / 100})
>>> df.loc[df['date'].dt.hour.eq(2) & df['date'].dt.minute.eq(0), :]
date Open
120 2019-01-01 02:00:00 1.003764
1560 2019-01-02 02:00:00 1.015878
3000 2019-01-03 02:00:00 1.015933
4440 2019-01-04 02:00:00 0.990582
5880 2019-01-05 02:00:00 0.982440
7320 2019-01-06 02:00:00 1.012546
8760 2019-01-07 02:00:00 0.979695
10200 2019-01-08 02:00:00 1.013195
11640 2019-01-09 02:00:00 0.993046
I have a pandas time series dataframe with a value for each hour of the day over an extended period, like this:
value
datetime
2018-01-01 00:00:00 38
2018-01-01 01:00:00 31
2018-01-01 02:00:00 78
2018-01-01 03:00:00 82
2018-01-01 04:00:00 83
2018-01-01 05:00:00 95
...
I want to create a new dataframe with the minimum value between hours 01:00 - 04:00 for each day but can't figure out how to do this.. the closest i can think of is:
df2 = df.groupby([pd.Grouper(freq='d'), df.between_time('01:00', '04:00')]).min()))
but that gives me:
ValueError: Grouper for '' not 1-dimensional
Use DataFrame.between_time with DataFrame.resample:
df = df.between_time('01:00', '04:00').resample('d').min()
print (df)
value
datetime
2018-01-01 31
Your solution is very close, only chain functions differently:
df = df.between_time('01:00', '04:00').groupby(pd.Grouper(freq='d')).min()
print (df)
value
datetime
2018-01-01 31
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')