Is there a simple way to obtain the hour of the year from a datetime?
dt = datetime(2019, 1, 3, 00, 00, 00) # 03/01/2019 00:00
dt_hour = dt.hour_of_year() # should be something like that
Expected output: dt_hour = 48
It would be nice as well to obtain minutes_of_year and seconds_of_year
One way of implementing this yourself is this:
def hour_of_year(dt):
beginning_of_year = datetime.datetime(dt.year, 1, 1, tzinfo=dt.tzinfo)
return (dt - beginning_of_year).total_seconds() // 3600
This first creates a new datetime object representing the beginning of the year. We then compute the time since the beginning of the year in seconds, divide by 3600 and take the integer part to get the full hours that have passed since the beginning of the year.
Note that using the days attribute of the timedelta object will only return the number of full days since the beginning of the year.
You can use timedelta:
import datetime
dt = datetime.datetime(2019, 1, 3, 00, 00, 00)
dt2 = datetime.datetime(2019, 1, 1, 00, 00, 00)
print((dt-dt2).days*24)
output:
48
All three functions, reusing their code.
import datetime
def minutes_of_year(dt):
return seconds_of_year(dt) // 60
def hours_of_year(dt):
return minutes_of_year(dt) // 60
def seconds_of_year(dt):
dt0 = datetime.datetime(dt.year, 1, 1, tzinfo=dt.tzinfo)
delta = dt-dt0
return int(delta.total_seconds())
Edited to take possible time zone info into account.
Or: subclass datetime, for easier reuse in later projects:
import datetime
class MyDateTime(datetime.datetime):
def __new__(cls, *args, **kwargs):
return datetime.datetime.__new__(cls, *args, **kwargs)
def minutes_of_year(self):
return self.seconds_of_year() // 60
def hours_of_year(self):
return self.minutes_of_year() // 60
def seconds_of_year(self):
dt0 = datetime.datetime(self.year, 1, 1, tzinfo=self.tzinfo)
delta = self-dt0
return int(delta.total_seconds())
# create and use like a normal datetime object
dt = MyDateTime.now()
# properties and functions of datetime still available, of course.
print(dt.day)
# ... and new methods:
print(dt.hours_of_year())
You can write a custom function
def get_time_of_year(dt, type = 'hours_of_year'):
intitial_date = datetime(dt.year, 1,1, 00, 00, 00)
duration = dt - intitial_date
days, seconds = duration.days, duration.seconds
hours = days * 24 + seconds // 3600
minutes = (seconds % 3600) // 60
if type == 'hours_of_year':
return hours
if type == 'days_of_year':
return days
if type == 'seconds_of_year':
return seconds
if type == 'minuts_of_year':
return minutes
test function
get_time_of_year(dt, 'hours_of_year')
#>>48
I have the dataframe DF that has the column 'Timestamp' with type datetime64[ns].
The column timestamp looks like this:
DF['Timestamp']:
0 2022-01-01 00:00:00
1 2022-01-01 01:00:00
2 2022-01-01 02:00:00
3 2022-01-01 03:00:00
4 2022-01-01 04:00:00
...
8755 2022-12-31 19:00:00
8756 2022-12-31 20:00:00
8757 2022-12-31 21:00:00
8758 2022-12-31 22:00:00
8759 2022-12-31 23:00:00
Name: Timestamp, Length: 8760, dtype: datetime64[ns]
I extract 'Hour of Year' in this way:
DF['Year'] = DF['Timestamp'].astype('M8[Y]')
DF['DayOfYear'] = (DF['Timestamp'] - DF['Year']).astype('timedelta64[D]')
DF['Hour'] = DF['Timestamp'].dt.hour + 1
DF['HourOfYear'] = DF['DayOfYear'] * 24 + DF['Hour']
First it extracts the year from the Timestamp.
Next it creates a time delta from beginning of the year to that Timestamp based on days (in other words, day of year).
Then it extracts the hour from the timestamp.
Finally it calculates the hour of the year with that formula.
And it looks like this in the end:
DF:
Timestamp ... HourOfYear
0 2022-01-01 00:00:00 ... 1.0
1 2022-01-01 01:00:00 ... 2.0
2 2022-01-01 02:00:00 ... 3.0
3 2022-01-01 03:00:00 ... 4.0
4 2022-01-01 04:00:00 ... 5.0
...
8755 2022-12-31 19:00:00 ... 8756.0
8756 2022-12-31 20:00:00 ... 8757.0
8757 2022-12-31 21:00:00 ... 8758.0
8758 2022-12-31 22:00:00 ... 8759.0
8759 2022-12-31 23:00:00 ... 8760.0
[8760 rows x 6columns]
Related
In this question (Get year, month or day from numpy datetime64) an example on how to get year, month and day from a numpy datetime64 can be found.
One of the answers uses:
dates = np.arange(np.datetime64('2000-01-01'), np.datetime64('2010-01-01'))
years = dates.astype('datetime64[Y]').astype(int) + 1970
months = dates.astype('datetime64[M]').astype(int) % 12 + 1
days = dates - dates.astype('datetime64[M]') + 1
Also notice that:
To get integers instead of timedelta64[D] in the example for days above, use: (dates - dates.astype('datetime64[M]')).astype(int) + 1
How could the hours, minutes and seconds be extracted?
As stated in the comment to return integers, I would like to get integers too.
Edit:
Jérôme's answer is useful but I am still struggling to properly understand how do I reach the safe point of having datetime64[s] as input data.
In my actual situation this is what I have once I read the CSV in Pandas:
print(df['date'])
print(type(df['date']))
print(df['date'].dtype)
0 2018-12-31 23:59:00
1 2018-12-31 23:58:00
2 2018-12-31 23:57:00
3 2018-12-31 23:56:00
4 2018-12-31 23:55:00
...
525594 2018-01-01 00:05:00
525595 2018-01-01 00:04:00
525596 2018-01-01 00:03:00
525597 2018-01-01 00:02:00
525598 2018-01-01 00:01:00
Name: date, Length: 525599, dtype: object
<class 'pandas.core.series.Series'>
object
So how could I convert df['dates'] into a dates variable which is datetime64[s] and then apply the solution provided?
In your example, the type of the array is np.datetime64[D] so the hours/minutes/seconds are not stored in the items. However, the np.datetime64[s] does this.
Here is how to extract the information from a np.datetime64[s]-typed array:
# dates = array(['2009-08-29T23:44:31',
# '2017-12-17T05:47:37'],
# dtype='datetime64[s]')
dates = np.array([
np.datetime64(1251589471, 's'),
np.datetime64(1513489657, 's')
])
Y, M, D, h, m, s = [dates.astype('datetime64[%s]' % kind) for kind in 'YMDhms']
years = Y.astype(int) + 1970
months = M.astype(int) % 12 + 1
days = (D - M).astype(int) + 1
hours = (h - D).astype(int)
minutes = (m - h).astype(int)
seconds = (s - m).astype(int)
# [array([2009, 2017]),
# array([ 8, 12], dtype=int32),
# array([29, 17]),
# array([23, 5]),
# array([44, 47]),
# array([31, 37])])
print([years, months, days, hours, minutes, seconds])
I have some data as CSV in the format id, time, var. I then went on and created a Multiindex DataFrame roughly of the form below
import numpy as np
import pandas as pd
def time(t):
return pd.Timestamp("2019-01-01T12") + pd.to_timedelta(t, "d")
arrays = [
np.array([1, 1, 2, 2, 3, 3]),
np.array([time(0), time(1), time(396), time(365), time(31), time(365)]),
]
df = pd.DataFrame(np.random.randn(6, 1), index=arrays, columns=["var"])
df.index.names = ["id", "time"]
df
var
id time
1 2019-01-01 12:00:00 -0.505903
2019-01-02 12:00:00 0.626197
2 2020-02-01 12:00:00 0.461155
2020-01-01 12:00:00 0.569891
3 2019-02-01 12:00:00 -1.079466
2020-01-01 12:00:00 0.721466
Given this, I would like to find all id's for which the earliest entry is in January, to then plot the trajectory represented by the id for only trajectories which start in January.
As a note, I think the time is actually sorted, while the id is not. Not sure if that changes anything tho.
i.e.
df.pseudo_filter(start_month="January")
var
id time
1 2019-01-01 12:00:00 -0.505903
2019-01-02 12:00:00 0.626197
2 2020-02-01 12:00:00 0.461155
2020-01-01 12:00:00 0.569891
You can groupby.filter by the month of min time
df.groupby(level=0).filter(lambda x: x.index.get_level_values(1).min().month == 1)
Or
df.groupby(level=0).filter(lambda x: x.index.get_level_values(1).min().month_name() == 'January')
Out:
var
id time
1 2019-01-01 12:00:00 0.410113
2019-01-02 12:00:00 -0.572882
2 2020-02-01 12:00:00 -0.801334
2020-01-01 12:00:00 1.312035
To add the filter as a new function to your dataframe
#pd.api.extensions.register_dataframe_accessor("pseudo")
class Pseudo:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def filter(self, start_month):
return (self._obj.groupby(level=0)
.filter(lambda x: x.index.get_level_values(1).min()
.month_name() == start_month))
Then you can use
df.pseudo.filter(start_month='January')
Out:
var
id time
1 2019-01-01 12:00:00 -1.314898
2019-01-02 12:00:00 0.810314
2 2020-02-01 12:00:00 -1.214327
2020-01-01 12:00:00 -0.678823
I have pandas dataframe with two timestamps columns start and end
start end
2014-08-28 17:00:00 | 2014-08-29 22:00:00
2014-08-29 10:45:00 | 2014-09-01 17:00:00
2014-09-01 15:00:00 | 2014-09-01 19:00:00
The intention is to aggregate the number of hours that were logged on a given date. So in the case of my example.
I would be creating date range and aggreating the hours over multiple entries.
2014-08-28 -> 7 hrs
2014-08-29 -> 10 hrs + 1 hr 15 min => 11 hrs 15 mins
2014-08-30 -> 24 hrs
2014-08-31 -> 24 hrs
2014-09-01 -> 17 hrs + 4 hrs => 21 hrs
I've tried using timedelta but it only splits in absolute hours, not on a per day basis.
I've also tried to explode the rows(i.e split the row on a day basis but I could only get it to works at a date level, not at a time stamp level)
Any suggestion are greatly appreciated.
you can use of pd.date_range to create a minute to minute interval of each day that spent, after that you can count the spent minutes and convert it to time delta
start end
0 2014-08-28 17:00:00 2014-08-29 22:00:00
1 2014-08-29 10:45:00 2014-09-01 17:00:00
2 2014-09-01 15:00:00 2014-09-01 19:00:00
#Creating the minute to minute time intervals from start to end date of each line and creating as one series of dates
a = pd.Series(sum(df.apply(lambda x: pd.date_range(x['start'],x['end'],freq='min').tolist(),1).tolist(),[])).dt.date
# Counting the each mintue intervals and converting to time stamps
a.value_counts().apply(lambda x: pd.to_timedelta(x,'m'))
Out:
2014-08-29 1 days 11:16:00
2014-08-30 1 days 00:00:00
2014-08-31 1 days 00:00:00
2014-09-01 0 days 21:02:00
2014-08-28 0 days 07:00:00
dtype: timedelta64[ns]
Hope that would be useful. I guess you'll be able to adjust to serve your purpose. Way to thinking is the following - store day and corresponding time in dict. if it's the same day - just write difference. Otherwise write time till first midnight, iterate whenever days needed and write time from last midnight till end. FYI... I guess for 2014-09-01 result might be 21 hrs.
from datetime import datetime, timedelta
from collections import defaultdict
s = [('2014-08-28 17:00:00', '2014-08-29 22:00:00'),
('2014-08-29 10:45:00', '2014-09-01 17:00:00'),
('2014-09-01 15:00:00', '2014-09-01 19:00:00') ]
def aggreate(time):
store = defaultdict(timedelta)
for slice in time:
start = datetime.strptime(slice[0], "%Y-%m-%d %H:%M:%S")
end = datetime.strptime(slice[1], "%Y-%m-%d %H:%M:%S")
start_date = start.date()
end_date = end.date()
if start_date == end_date:
store[start_date] += end - start
else:
midnight = datetime(start.year, start.month, start.day + 1, 0, 0, 0)
part1 = midnight - start
store[start_date] += part1
for i in range(1, (end_date - start_date).days):
next_date = start_date + timedelta(days=i)
store[next_date] += timedelta(hours=24)
last_midnight = datetime(end_date.year, end_date.month, end_date.day, 0, 0, 0)
store[end_date] += end - last_midnight
return store
r = aggreate(s)
for i in r:
print(i, r[i])
2014-08-28 7:00:00
2014-08-29 1 day, 11:15:00
2014-08-30 1 day, 0:00:00
2014-08-31 1 day, 0:00:00
2014-09-01 21:00:00
I have a range of timestamps with start time and end time. I would like to generate the number of minutes per hour between the two timestamps:
import pandas as pd
start_time = pd.to_datetime('2013-03-26 21:49:08',infer_datetime_format=True)
end_time = pd.to_datetime('2013-03-27 05:21:00, infer_datetime_format=True)
pd.date_range(start_time, end_time, freq='h')
which gives:
DatetimeIndex(['2013-03-26 21:49:08', '2013-03-26 22:49:08',
'2013-03-26 23:49:08', '2013-03-27 00:49:08',
'2013-03-27 01:49:08', '2013-03-27 02:49:08',
'2013-03-27 03:49:08', '2013-03-27 04:49:08'],
dtype='datetime64[ns]', freq='H')
Sample result: I would like to compute the number of minutes bounded by the hour between the start and end times, like below:
2013-03-26 21:00:00' - 10m 52secs
2013-03-26 22:00:00' - 60 m
2013-03-26 23:00:00' - 60 m
2013-03-27 05:00:00' - 21 m
I have looked at pandas resample, but not exactly sure how to achieve this. Any direction is appreciated.
Construct two Series corresponding to the start and end time of each hour. Use clip_lower and clip_upper to restrict them to be within your desired timespan, then subtract:
# hourly range, floored to the nearest hour
rng = pd.date_range(start_time.floor('h'), end_time.floor('h'), freq='h')
# get the left and right endpoints for each hour
# clipped to be inclusive of [start_time, end_time]
left = pd.Series(rng, index=rng).clip_lower(start_time)
right = pd.Series(rng + 1, index=rng).clip_upper(end_time)
# construct a series of the lengths
s = right - left
The resulting output:
2013-03-26 21:00:00 00:10:52
2013-03-26 22:00:00 01:00:00
2013-03-26 23:00:00 01:00:00
2013-03-27 00:00:00 01:00:00
2013-03-27 01:00:00 01:00:00
2013-03-27 02:00:00 01:00:00
2013-03-27 03:00:00 01:00:00
2013-03-27 04:00:00 01:00:00
2013-03-27 05:00:00 00:21:00
Freq: H, dtype: timedelta64[ns]
Utilizing datetime.timedelta() in some sort of for loop seems like it's what you're looking for.
https://docs.python.org/2/library/datetime.html#datetime.timedelta
It seems like this might be a viable solution:
import pandas as pd
import datetime as dt
def bounded_min(t, range_time):
""" For a given timestamp t and considered time interval range_time,
return the desired bounded value in minutes and seconds"""
# min() takes care of the end of the time interval,
# max() takes care of the beginning of the interval
s = (min(t + dt.timedelta(hours=1), range_time.max()) -
max(t, range_time.min())).total_seconds()
if s%60:
return "%dm %dsecs" % (s/60, s%60)
else:
return "%dm" % (s/60)
start_time = pd.to_datetime('2013-03-26 21:49:08',infer_datetime_format=True)
end_time = pd.to_datetime('2013-03-27 05:21:00', infer_datetime_format=True)
range_time = pd.date_range(start_time, end_time, freq='h')
# Include the end of the time range using the union() trick, as described at:
# https://stackoverflow.com/questions/37890391/how-to-include-end-date-in-pandas-date-range-method
range_time = range_time.union([end_time])
# This is essentially timestamps for beginnings of hours
index_time = pd.Series(range_time).apply(lambda x: dt.datetime(year=x.year,
month=x.month,
day=x.day,
hour=x.hour,
minute=0,
second=0))
bounded_mins = index_time.apply(lambda x: bounded_min(x, range_time))
# Put timestamps and values together
bounded_df = pd.DataFrame(bounded_mins, columns=["Bounded Mins"]).set_index(index_time)
print bounded_df
Gotta love the powerful lambdas:). Maybe there is a simpler way to do it though.
Output:
Bounded Mins
2013-03-26 21:00:00 10m 52secs
2013-03-26 22:00:00 60m
2013-03-26 23:00:00 60m
2013-03-27 00:00:00 60m
2013-03-27 01:00:00 60m
2013-03-27 02:00:00 60m
2013-03-27 03:00:00 60m
2013-03-27 04:00:00 60m
2013-03-27 05:00:00 21m
I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00