percentage difference of datetime object - python

I want to create a new column which contains the values of column diff(s) but in percentage.
Finish Time diff (s)
0 1900-01-01 00:42:43.500 0 days 00:00:00
1 1900-01-01 00:44:01.200 0 days 00:01:17
2 1900-01-01 00:44:06.500 0 days 00:01:23
3 1900-01-01 00:44:29.500 0 days 00:01:46
4 1900-01-01 00:44:47.500 0 days 00:02:04
to further understand the data:
df["diff(s)"] = df["Finish Time"] - min(df["Finish Time"])
Finish Time datetime64[ns]
diff (s) timedelta64[ns]
dtype: object
df["diff(%)"] = ((df["Finish Time"]/min(df["Finish
Time"]))*100)
-> results in this error
TypeError: cannot perform __truediv__ with this index type:
DatetimeArray

It depends how are defined percentages - if need divide by summed timedeltas:
df["diff(s)"] = df["Finish Time"] - df["Finish Time"].min()
df["diff(%)"] = (df["diff(s)"] / df["diff(s)"].sum()) * 100
print (df)
Finish Time diff(s) diff(%)
0 1900-01-01 00:42:43.500 0 days 00:00:00 0.000000
1 1900-01-01 00:44:01.200 0 days 00:01:17.700000 19.887382
2 1900-01-01 00:44:06.500 0 days 00:01:23 21.243921
3 1900-01-01 00:44:29.500 0 days 00:01:46 27.130791
4 1900-01-01 00:44:47.500 0 days 00:02:04 31.737906
Or using Series.pct_change:
df["diff(%)"] = df["diff(s)"].pct_change() * 100

Related

Converting Milliseconds into Clocktime Format

I've been trying to convert a milliseconds (0, 5000, 10000) column into a new column with the format: 00:00:00 (00:00:05, 00:00:10 etc)
I tried datetime.datetime.fromtimestamp(5000/1000.0) but it didn't give me the format I wanted.
Any help appreciated!
The best is probably to convert to TimeDelta (using pandas.to_timedelta).
Thus you'll benefit from the timedelta object properties
s = pd.Series([0, 5000, 10000])
s2 = pd.to_timedelta(s, unit='ms')
output:
0 0 days 00:00:00
1 0 days 00:00:05
2 0 days 00:00:10
dtype: timedelta64[ns]
If you really want the '00:00:00' format, use instead pandas.to_datetime:
s2 = pd.to_datetime(s, unit='ms').dt.time
output:
0 00:00:00
1 00:00:05
2 00:00:10
dtype: object
optionally with .astype(str) to have strings
Convert values to timedeltas by to_timedelta:
df['col'] = pd.to_timedelta(df['col'], unit='ms')
print (df)
col
0 0 days 00:00:00
1 0 days 00:00:05
2 0 days 00:00:10

Calculates a standard deviation columns for timedelta elements

I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ESP
5 days 02:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
2
ENG
0 days 12:01:02
For each ID I want to calculate the standard deviation of each country_ID group.
std_visit_ESP and std_visit_ENG columns.
standard deviation of visit time with country_ID = ESP for each ID.
standard deviation of visit time with country_ID = ENG for each ID.
ID
std_visit_ESP
std_visit_ENG
0
2 days 17:00:00
0 days 00:00:00
1
0 days 00:00:00
0 days 12:32:00
2
NaT
0 days 00:00:00
With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.
My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:
td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
arr = [td1,td2]
var = np.std(arr)
show_s = pd.to_timedelta(var, unit='s')
print(show_s)
I don't know how to use this with groupby to get the desired result. I am grateful for your help.
Use GroupBy.std and pd.to_timedelta
total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00
If I understand correctly, this should work for you:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))
Output:
>>> stddevs
country_ID
ENG 2 days 01:17:43.835702
ESP 4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]
Formatting:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)
Output:
>>> stddevs
std_visit_ENG std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773

Time calculations, mean , median, mode

(
Name
Gun_time
Net_time
Pace
John
28:48:00
28:47:00
4:38:00
George
29:11:00
29:10:00
4:42:00
Mike
29:38:00
29:37:00
4:46:00
Sarah
29:46:00
29:46:00
4:48:00
Roy
30:31:00
30:30:00
4:55:00
Q1. How can I add another column stating difference between Gun_time and Net_time?
Q2. How will I calculate the mean for Gun_time and Net_time. Please help!
I have tried doing the following but it doesn't work
df['Difference'] = df['Gun_time'] - df['Net_time']
for mean value I tried df['Gun_time'].mean
but it doesn't work either, please help!
Q.3 What if we have times in 28:48 (minutes and seconds) format and not 28:48:00 the function gives out a value error.
ValueError: expected hh:mm:ss format
Convert your columns to dtype timedelta, e.g. like
for col in ("Gun_time", "Net_time", "Pace"):
df[col] = pd.to_timedelta(df[col])
Now you can do calculations like
df['Gun_time'].mean()
# Timedelta('1 days 05:34:48')
or
df['Difference'] = df['Gun_time'] - df['Net_time']
#df['Difference']
# 0 0 days 00:01:00
# 1 0 days 00:01:00
# 2 0 days 00:01:00
# 3 0 days 00:00:00
# 4 0 days 00:01:00
# Name: Difference, dtype: timedelta64[ns]
If you need nicer output to string, you can use
def timedeltaToString(td):
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"
df['diffString'] = df['Difference'].apply(timedeltaToString)
# df['diffString']
# 0 00:01:00
# 1 00:01:00
# 2 00:01:00
# 3 00:00:00
# 4 00:01:00
#Name: diffString, dtype: object
See also Format timedelta to string.

Creating Bin for timestamp column

I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!
In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00
You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])

Unable to convert a column to datetime

I have tried many suggestions from here but none of them solved.
I have two columns with observations like this: 15:08:19
If I write
df.time_entry.describe()
it appears:
count 814262
unique 56765
top 15:03:00
freq 103
Name: time_entry, dtype: object
I've already run this code:
df['time_entry'] = pd.to_datetime(df['time_entry'],format= '%H:%M:%S', errors='ignore' ).dt.time
But rerunning the describe code still returns dtype: object.
What is the purpose of dt.time?
Just remove dt.time and your conversion from object to datetime will work perfectly fine.
df['time_entry'] = pd.to_datetime(df['time_entry'],format= '%H:%M:%S')
The problem is that you are using the datetime accessor (.dt) with the property time and then you are not able to subtract the two columns from eachother. So, just leave out .dt.time and it should work.
Here is some data with 2 columns of strings
df = pd.DataFrame()
df['time_entry'] = ['12:01:00', '15:03:00', '16:43:00', '14:11:00']
df['time_entry2'] = ['13:03:00', '14:04:00', '19:23:00', '18:12:00']
print(df)
time_entry time_entry2
0 12:01:00 13:03:00
1 15:03:00 14:04:00
2 16:43:00 19:23:00
3 14:11:00 18:12:00
Convert both columns to datetime dtype
df['time_entry'] = pd.to_datetime(df['time_entry'], format= '%H:%M:%S', errors='ignore')
df['time_entry2'] = pd.to_datetime(df['time_entry2'], format= '%H:%M:%S', errors='ignore')
print(df)
time_entry time_entry2
0 1900-01-01 12:01:00 1900-01-01 13:03:00
1 1900-01-01 15:03:00 1900-01-01 14:04:00
2 1900-01-01 16:43:00 1900-01-01 19:23:00
3 1900-01-01 14:11:00 1900-01-01 18:12:00
print(df.dtypes)
time_entry datetime64[ns]
time_entry2 datetime64[ns]
dtype: object
(Optional) Specify timezone
df['time_entry'] = df['time_entry'].dt.tz_localize('US/Central')
df['time_entry2'] = df['time_entry2'].dt.tz_localize('US/Central')
Now perform the time difference (subtraction) between the 2 columns and get the time difference in number of days (as a float)
Method 1 gives Diff_days1
Method 2 gives Diff_days2
Method 3 gives Diff_days3
df['Diff_days1'] = (df['time_entry'] - df['time_entry2']).dt.total_seconds()/60/60/24
df['Diff_days2'] = (df['time_entry'] - df['time_entry2']) / np.timedelta64(1, 'D')
df['Diff_days3'] = (df['time_entry'].sub(df['time_entry2'])).dt.total_seconds()/60/60/24
print(df)
time_entry time_entry2 Diff_days1 Diff_days2 Diff_days3
0 1900-01-01 12:01:00 1900-01-01 13:03:00 -0.043056 -0.043056 -0.043056
1 1900-01-01 15:03:00 1900-01-01 14:04:00 0.040972 0.040972 0.040972
2 1900-01-01 16:43:00 1900-01-01 19:23:00 -0.111111 -0.111111 -0.111111
3 1900-01-01 14:11:00 1900-01-01 18:12:00 -0.167361 -0.167361 -0.167361
EDIT
If you're trying to access datetime attributes, then you can do so by using the time_entry column directly (not the time difference column). Here's an example
df['day1'] = df['time_entry'].dt.day
df['time1'] = df['time_entry'].dt.time
df['minute1'] = df['time_entry'].dt.minute
df['dayofweek1'] = df['time_entry'].dt.weekday
df['day2'] = df['time_entry2'].dt.day
df['time2'] = df['time_entry2'].dt.time
df['minute2'] = df['time_entry2'].dt.minute
df['dayofweek2'] = df['time_entry2'].dt.weekday
print(df[['day1', 'time1', 'minute1', 'dayofweek1',
'day2', 'time2', 'minute2', 'dayofweek2']])
day1 time1 minute1 dayofweek1 day2 time2 minute2 dayofweek2
0 1 12:01:00 1 0 1 13:03:00 3 0
1 1 15:03:00 3 0 1 14:04:00 4 0
2 1 16:43:00 43 0 1 19:23:00 23 0
3 1 14:11:00 11 0 1 18:12:00 12 0

Categories

Resources