Creating Bin for timestamp column - python

I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!

In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00

You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])

Related

percentage difference of datetime object

I want to create a new column which contains the values of column diff(s) but in percentage.
Finish Time diff (s)
0 1900-01-01 00:42:43.500 0 days 00:00:00
1 1900-01-01 00:44:01.200 0 days 00:01:17
2 1900-01-01 00:44:06.500 0 days 00:01:23
3 1900-01-01 00:44:29.500 0 days 00:01:46
4 1900-01-01 00:44:47.500 0 days 00:02:04
to further understand the data:
df["diff(s)"] = df["Finish Time"] - min(df["Finish Time"])
Finish Time datetime64[ns]
diff (s) timedelta64[ns]
dtype: object
df["diff(%)"] = ((df["Finish Time"]/min(df["Finish
Time"]))*100)
-> results in this error
TypeError: cannot perform __truediv__ with this index type:
DatetimeArray
It depends how are defined percentages - if need divide by summed timedeltas:
df["diff(s)"] = df["Finish Time"] - df["Finish Time"].min()
df["diff(%)"] = (df["diff(s)"] / df["diff(s)"].sum()) * 100
print (df)
Finish Time diff(s) diff(%)
0 1900-01-01 00:42:43.500 0 days 00:00:00 0.000000
1 1900-01-01 00:44:01.200 0 days 00:01:17.700000 19.887382
2 1900-01-01 00:44:06.500 0 days 00:01:23 21.243921
3 1900-01-01 00:44:29.500 0 days 00:01:46 27.130791
4 1900-01-01 00:44:47.500 0 days 00:02:04 31.737906
Or using Series.pct_change:
df["diff(%)"] = df["diff(s)"].pct_change() * 100

From unix timestamps to relative date based on a condition from another column in pandas

I have a column of dates in unix timestamps and i need to convert them into relative dates from the starting activity.
The final output should be the column D, which expresses the relative time from the activity which has index = 1, in particular the relative time has always to refer to the first activity (index=1).
A index timestamp D
activity1 1 1.612946e+09 0
activity2 2 1.614255e+09 80 hours
activity3 1 1.612181e+09 0
activity4 2 1.613045e+09 50 hours
activity5 3 1.637668e+09 430 hours
Any idea?
Use to_datetime with unit='s' and then create groups starting by index equal 1 and get first value, last subtract and convert to hours:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
s = df.groupby(df['index'].eq(1).cumsum())['timestamp'].transform('first')
df['D1'] = df['timestamp'].sub(s).dt.total_seconds().div(3600)
print (df)
A index timestamp D D1
0 activity1 1 2021-02-10 08:33:20 0 0.000000
1 activity2 2 2021-02-25 12:10:00 80 hours 363.611111
2 activity3 1 2021-02-01 12:03:20 0 0.000000
3 activity4 2 2021-02-11 12:03:20 50 hours 240.000000
4 activity5 3 2021-11-23 11:46:40 430 hours 7079.722222

Calculates a standard deviation columns for timedelta elements

I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ESP
5 days 02:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
2
ENG
0 days 12:01:02
For each ID I want to calculate the standard deviation of each country_ID group.
std_visit_ESP and std_visit_ENG columns.
standard deviation of visit time with country_ID = ESP for each ID.
standard deviation of visit time with country_ID = ENG for each ID.
ID
std_visit_ESP
std_visit_ENG
0
2 days 17:00:00
0 days 00:00:00
1
0 days 00:00:00
0 days 12:32:00
2
NaT
0 days 00:00:00
With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.
My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:
td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
arr = [td1,td2]
var = np.std(arr)
show_s = pd.to_timedelta(var, unit='s')
print(show_s)
I don't know how to use this with groupby to get the desired result. I am grateful for your help.
Use GroupBy.std and pd.to_timedelta
total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00
If I understand correctly, this should work for you:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))
Output:
>>> stddevs
country_ID
ENG 2 days 01:17:43.835702
ESP 4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]
Formatting:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)
Output:
>>> stddevs
std_visit_ENG std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773

Time calculations, mean , median, mode

(
Name
Gun_time
Net_time
Pace
John
28:48:00
28:47:00
4:38:00
George
29:11:00
29:10:00
4:42:00
Mike
29:38:00
29:37:00
4:46:00
Sarah
29:46:00
29:46:00
4:48:00
Roy
30:31:00
30:30:00
4:55:00
Q1. How can I add another column stating difference between Gun_time and Net_time?
Q2. How will I calculate the mean for Gun_time and Net_time. Please help!
I have tried doing the following but it doesn't work
df['Difference'] = df['Gun_time'] - df['Net_time']
for mean value I tried df['Gun_time'].mean
but it doesn't work either, please help!
Q.3 What if we have times in 28:48 (minutes and seconds) format and not 28:48:00 the function gives out a value error.
ValueError: expected hh:mm:ss format
Convert your columns to dtype timedelta, e.g. like
for col in ("Gun_time", "Net_time", "Pace"):
df[col] = pd.to_timedelta(df[col])
Now you can do calculations like
df['Gun_time'].mean()
# Timedelta('1 days 05:34:48')
or
df['Difference'] = df['Gun_time'] - df['Net_time']
#df['Difference']
# 0 0 days 00:01:00
# 1 0 days 00:01:00
# 2 0 days 00:01:00
# 3 0 days 00:00:00
# 4 0 days 00:01:00
# Name: Difference, dtype: timedelta64[ns]
If you need nicer output to string, you can use
def timedeltaToString(td):
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"
df['diffString'] = df['Difference'].apply(timedeltaToString)
# df['diffString']
# 0 00:01:00
# 1 00:01:00
# 2 00:01:00
# 3 00:00:00
# 4 00:01:00
#Name: diffString, dtype: object
See also Format timedelta to string.

generate a plot based on time-delta value

I have data frame which captures the data sent from the server. the server data at least once in 5 minutes. if the server doesn't send data for more than 5 minutes that time till the data is sent again is considered as black-out. I want to visualize these blackouts in a graph. The data frame looks like
timestamp temperature
2019-06-03 14:16:31.149132 27.17
2019-06-03 14:21:34.732911 27.13
2019-06-03 14:37:20.437143 27.16
2019-06-03 14:42:15.516416 27.13
2019-06-03 14:51:26.167553 27.19
2019-06-03 14:56:31.244862 27.02
2019-06-03 15:07:30.519727 27.1
2019-06-03 15:12:57.319953 27.12
2019-06-03 15:17:56.256638 27.12
I have calculated the time difference between two time stamps and, marked blackout and calculated the blackout time.
code:
df['TimeDelta'] = df['timestamp'] - df['timestamp'].shift()
df['blackout'] = np.where(df['TimeDelta'] > datetime.timedelta(minutes = 5) , 1 , 0)
df['blackoutTime'] = np.where(df['blackout'] > 0, df['TimeDelta'] - datetime.timedelta(minutes = 5), 0)
df['blackoutMins'] = df['blackoutTime'] / np.timedelta64(1,'m')
which gives 4 additional columns
TimeDelta blackout blackoutIime blackoutMins
0 days 00:04:57.310512000 0 0 days 00:00:00.000000000 0.0
0 days 00:05:03.583779000 1 0 days 00:00:03.583779000 0.05972965
0 days 00:15:45.704232000 1 0 days 00:10:45.704232000 10.7617372
0 days 00:04:55.079273000 0 0 days 00:00:00.000000000 0.0
0 days 00:09:10.651137000 1 0 days 00:04:10.651137000 4.17751895
0 days 00:05:05.077309000 1 0 days 00:00:05.077309000 0.08462181666666667
0 days 00:10:59.274865000 1 0 days 00:05:59.274865000 5.9879144166666665
0 days 00:05:26.800226000 1 0 days 00:00:26.800226000 0.44667043333333334
0 days 00:04:58.936685000 0 0 days 00:00:00.000000000 0.0
0 days 00:05:16.684317000 1 0 days 00:00:16.684317000 0.27807195
0 days 00:05:02.304786000 1 0 days 00:00:02.304786000 0.0384131
So what i want i am trying is to visualize the blackouts with time on x-axis and blackout on the y-axis, i want something like
with x-axis being the time axis and y -axis showing the time for which its blackout. Can someone help with how to do this visualization.
You want plt.step against the original timestamp:
df['blackout'] = df.timestamp.diff().gt('5min').astype(int)
plt.step(df.timestamp, df.blackout, c='red')
Output:

Categories

Resources