In this question (Get year, month or day from numpy datetime64) an example on how to get year, month and day from a numpy datetime64 can be found.
One of the answers uses:
dates = np.arange(np.datetime64('2000-01-01'), np.datetime64('2010-01-01'))
years = dates.astype('datetime64[Y]').astype(int) + 1970
months = dates.astype('datetime64[M]').astype(int) % 12 + 1
days = dates - dates.astype('datetime64[M]') + 1
Also notice that:
To get integers instead of timedelta64[D] in the example for days above, use: (dates - dates.astype('datetime64[M]')).astype(int) + 1
How could the hours, minutes and seconds be extracted?
As stated in the comment to return integers, I would like to get integers too.
Edit:
Jérôme's answer is useful but I am still struggling to properly understand how do I reach the safe point of having datetime64[s] as input data.
In my actual situation this is what I have once I read the CSV in Pandas:
print(df['date'])
print(type(df['date']))
print(df['date'].dtype)
0 2018-12-31 23:59:00
1 2018-12-31 23:58:00
2 2018-12-31 23:57:00
3 2018-12-31 23:56:00
4 2018-12-31 23:55:00
...
525594 2018-01-01 00:05:00
525595 2018-01-01 00:04:00
525596 2018-01-01 00:03:00
525597 2018-01-01 00:02:00
525598 2018-01-01 00:01:00
Name: date, Length: 525599, dtype: object
<class 'pandas.core.series.Series'>
object
So how could I convert df['dates'] into a dates variable which is datetime64[s] and then apply the solution provided?
In your example, the type of the array is np.datetime64[D] so the hours/minutes/seconds are not stored in the items. However, the np.datetime64[s] does this.
Here is how to extract the information from a np.datetime64[s]-typed array:
# dates = array(['2009-08-29T23:44:31',
# '2017-12-17T05:47:37'],
# dtype='datetime64[s]')
dates = np.array([
np.datetime64(1251589471, 's'),
np.datetime64(1513489657, 's')
])
Y, M, D, h, m, s = [dates.astype('datetime64[%s]' % kind) for kind in 'YMDhms']
years = Y.astype(int) + 1970
months = M.astype(int) % 12 + 1
days = (D - M).astype(int) + 1
hours = (h - D).astype(int)
minutes = (m - h).astype(int)
seconds = (s - m).astype(int)
# [array([2009, 2017]),
# array([ 8, 12], dtype=int32),
# array([29, 17]),
# array([23, 5]),
# array([44, 47]),
# array([31, 37])])
print([years, months, days, hours, minutes, seconds])
Related
I have a dataset in which there is a column named "Time" of the type object.
There are some rows given as 10 for 10:00 and others as 1000.
how do I convert this column to time format.
weather['Time'] = pd.to_datetime(weather['Time'], format='%H:%M').dt.Time
this is the code I used.
I am getting this error, ValueError: time data '10' does not match format '%H:%M' (match)
You can convert column to required time format first like this
weather= pd.DataFrame(['1000','10:00','10','1000'],columns=list("Time"))
def convert_time(x):
if len(x) == 2:
return f'{x}:00'
if ':' not in x:
return x[:2] + ':' + x[2:]
return x
wheather.Time= wheather.Time.apply(convert_time)
wheather.Time
Out[1]:
0 10:00
1 10:00
2 10:00
3 10:00
To convert it to datetime
wheather.Time = pd.to_datetime(wheather.Time)
Just the time component
wheather.Time.dt.time
Out[92]:
0 10:00:00
1 10:00:00
2 10:00:00
3 10:00:00
Another possible solution, which is based on the following ideas:
Replace :, when there is one, by the empty string.
Right pad with zeros, so that all entries will have 4 digits.
Use pd.to_datetime to convert to the wanted time format.
weather = pd.DataFrame({'Time': ['20', '1000', '12:30', '0930']})
pd.to_datetime(weather['Time'].str.replace(':', '').str.pad(
4, side='right', fillchar='0'), format='%H%M').dt.time
Output:
0 20:00:00
1 10:00:00
2 12:30:00
3 09:30:00
Name: Time, dtype: object
I've been trying to convert a milliseconds (0, 5000, 10000) column into a new column with the format: 00:00:00 (00:00:05, 00:00:10 etc)
I tried datetime.datetime.fromtimestamp(5000/1000.0) but it didn't give me the format I wanted.
Any help appreciated!
The best is probably to convert to TimeDelta (using pandas.to_timedelta).
Thus you'll benefit from the timedelta object properties
s = pd.Series([0, 5000, 10000])
s2 = pd.to_timedelta(s, unit='ms')
output:
0 0 days 00:00:00
1 0 days 00:00:05
2 0 days 00:00:10
dtype: timedelta64[ns]
If you really want the '00:00:00' format, use instead pandas.to_datetime:
s2 = pd.to_datetime(s, unit='ms').dt.time
output:
0 00:00:00
1 00:00:05
2 00:00:10
dtype: object
optionally with .astype(str) to have strings
Convert values to timedeltas by to_timedelta:
df['col'] = pd.to_timedelta(df['col'], unit='ms')
print (df)
col
0 0 days 00:00:00
1 0 days 00:00:05
2 0 days 00:00:10
(
Name
Gun_time
Net_time
Pace
John
28:48:00
28:47:00
4:38:00
George
29:11:00
29:10:00
4:42:00
Mike
29:38:00
29:37:00
4:46:00
Sarah
29:46:00
29:46:00
4:48:00
Roy
30:31:00
30:30:00
4:55:00
Q1. How can I add another column stating difference between Gun_time and Net_time?
Q2. How will I calculate the mean for Gun_time and Net_time. Please help!
I have tried doing the following but it doesn't work
df['Difference'] = df['Gun_time'] - df['Net_time']
for mean value I tried df['Gun_time'].mean
but it doesn't work either, please help!
Q.3 What if we have times in 28:48 (minutes and seconds) format and not 28:48:00 the function gives out a value error.
ValueError: expected hh:mm:ss format
Convert your columns to dtype timedelta, e.g. like
for col in ("Gun_time", "Net_time", "Pace"):
df[col] = pd.to_timedelta(df[col])
Now you can do calculations like
df['Gun_time'].mean()
# Timedelta('1 days 05:34:48')
or
df['Difference'] = df['Gun_time'] - df['Net_time']
#df['Difference']
# 0 0 days 00:01:00
# 1 0 days 00:01:00
# 2 0 days 00:01:00
# 3 0 days 00:00:00
# 4 0 days 00:01:00
# Name: Difference, dtype: timedelta64[ns]
If you need nicer output to string, you can use
def timedeltaToString(td):
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"
df['diffString'] = df['Difference'].apply(timedeltaToString)
# df['diffString']
# 0 00:01:00
# 1 00:01:00
# 2 00:01:00
# 3 00:00:00
# 4 00:01:00
#Name: diffString, dtype: object
See also Format timedelta to string.
I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!
In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00
You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])
I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00