Pandas duration groupby - Start group-range with defined value - python

I am trying to group a data set of travel duration with 5 minutes interval, starting from 0 to inf. How may I do that?
My sample dataFrame looks like:
Duration
0 00:01:37
1 00:18:19
2 00:22:03
3 00:41:07
4 00:11:54
5 00:21:34
I have used this code: df.groupby([pd.Grouper(key='Duration', freq='5T')]).size()
And I have found following result:
Duration
00:01:37 1
00:06:37 0
00:11:37 1
00:16:37 2
00:21:37 1
00:26:37 0
00:31:37 0
00:36:37 1
00:41:37 0
Freq: 5T, dtype: int64
My expected result is:
Duration Counts
00:00:00 0
00:05:00 1
00:10:00 0
00:15:00 1
00:20:00 1
........ ...
My expectation is the index will start from 00:00:00 instead of 00:01:37.
Or, showing bins will also work for me, I mean:
Duration Counts
0-5 1
5-10 0
10-15 1
15-20 1
20-25 2
........ ...
I need your help please. Thank you.

First, you need to roud off your time to lower 5th minute. Then simply count it.
I suppose this is what you are looking for -
def round_to_5min(t):
""" This function rounds a timedelta timestamp to the nearest 5-min mark"""
t = datetime.datetime(1991,2,13, t.hour, t.minute - t.minute%5, 0)
return t
data['new_col'] = data.Duration.map(round_to_5min).dt.time

Related

Given a dataframe with event details, return a count of events that occured on any given date, based on the start and end dates of the events

What I have and am trying to do:
A dataframe, with headers: event_id, location_id, start_date, end_date.
An event can only have one location, start and end.
A location can have multiple events, starts and ends, and they can overlap.
The goal here is to be able to say, given any time T, for location X, how many events were there?
E.g.
Given three events, all for location 2:
Event.
Start.
End.
Event 1
2022-05-01
2022-05-07
Event 2
2022-05-04
2022-05-10
Event 3
2022-05-02
2022-05-05
Time T.
Count of Events
2022-05-01
1
2022-05-02
2
2022-05-03
2
2022-05-04
3
2022-05-05
3
2022-05-06
2
**What I have tried so far, but got stuck on: **
((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))
Try to get an array or dataframe that has a 365 day date range for each location ID.
E.g.
[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]
Then convert that array to a dataframe, and merge it with the original dataframe like:
index
location
time
event
location2
start
end
0
1
2022-01-01
1
10
2022-11-07
2022-11-12
1
1
2022-01-01
2
4
2022-02-16
2022-03-05
2
1
2022-01-01
3
99
2022-06-10
2022-06-15
3
1
2022-01-01
4
2
2021-12-31
2022-01-05
4
1
2022-01-01
5
5
2022-05-08
2022-05-22
Then perform some kind of reduction that returns the count:
location
Time
Count
1
2022-01-01
10
1
2022-01-02
3
1
2022-01-03
13
..
...
...
99
2022-01-01
4
99
2022-01-02
0
99
2022-01-03
7
99
2022-01-04
12
I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.
Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.
Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!
Thank you :)
Update per comment
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()
Date Location Count
0 2022-05-01 [Event 1] 1
1 2022-05-02 [Event 1, Event 3] 2
2 2022-05-03 [Event 1, Event 3] 2
3 2022-05-04 [Event 1, Event 2, Event 3] 3
4 2022-05-05 [Event 1, Event 2, Event 3] 3
5 2022-05-06 [Event 1, Event 2] 2
6 2022-05-07 [Event 1, Event 2] 2
7 2022-05-08 [Event 2] 1
8 2022-05-09 [Event 2] 1
9 2022-05-10 [Event 2] 1
IIUC you can try something like
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
6 2022-05-07 2
7 2022-05-08 1
8 2022-05-09 1
9 2022-05-10 1
Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.
You can also use a custom date range if you do not want to use the min and max values from the whole frame
min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but #It_is_Chris is the real MVP.
Update! - with the enormous help from #It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:
# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# create location range
loc_range = original_df['location'].unique()
# create a new list that combines every date with every location
combined_list = []
for item in date_range:
for location in loc_range:
combined_list.append(
{
'Date':item,
'location':location
}
)
# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)
# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")
# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]
# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()
The result looked like this:
Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.
Date
location
count
2022-01-01
1
1
2022-01-01
2
4
2022-01-01
3
1
2022-01-01
4
10
2022-01-01
5
3
2022-01-01
6
1

From unix timestamps to relative date based on a condition from another column in pandas

I have a column of dates in unix timestamps and i need to convert them into relative dates from the starting activity.
The final output should be the column D, which expresses the relative time from the activity which has index = 1, in particular the relative time has always to refer to the first activity (index=1).
A index timestamp D
activity1 1 1.612946e+09 0
activity2 2 1.614255e+09 80 hours
activity3 1 1.612181e+09 0
activity4 2 1.613045e+09 50 hours
activity5 3 1.637668e+09 430 hours
Any idea?
Use to_datetime with unit='s' and then create groups starting by index equal 1 and get first value, last subtract and convert to hours:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
s = df.groupby(df['index'].eq(1).cumsum())['timestamp'].transform('first')
df['D1'] = df['timestamp'].sub(s).dt.total_seconds().div(3600)
print (df)
A index timestamp D D1
0 activity1 1 2021-02-10 08:33:20 0 0.000000
1 activity2 2 2021-02-25 12:10:00 80 hours 363.611111
2 activity3 1 2021-02-01 12:03:20 0 0.000000
3 activity4 2 2021-02-11 12:03:20 50 hours 240.000000
4 activity5 3 2021-11-23 11:46:40 430 hours 7079.722222

How to convert X min, Y sec string to timestamp

I have a dataframe with a duration column of strings in a format like:
index
duration
0
26 s
1
24 s
2
4 min, 37 s
3
7 s
4
1 min, 1 s
Is there a pandas or strftime() / strptime() way to convert the duration column to a min/sec timestamp.
I've attempted this way to convert strings, but I'll run into multiple scenarios after replacing strings:
for row in df['index']:
if "min, " in df['duration'][row]:
df['duration'][row] = df['duration'][row].replace(' min, ', ':').replace(' s', '')
else:
pass
Thanks in advance
Try:
pd.to_timedelta(df['duration'])
Output:
0 0 days 00:00:26
1 0 days 00:00:24
2 0 days 00:04:37
3 0 days 00:00:07
4 0 days 00:01:01
Name: duration, dtype: timedelta64[ns]

Time calculations, mean , median, mode

(
Name
Gun_time
Net_time
Pace
John
28:48:00
28:47:00
4:38:00
George
29:11:00
29:10:00
4:42:00
Mike
29:38:00
29:37:00
4:46:00
Sarah
29:46:00
29:46:00
4:48:00
Roy
30:31:00
30:30:00
4:55:00
Q1. How can I add another column stating difference between Gun_time and Net_time?
Q2. How will I calculate the mean for Gun_time and Net_time. Please help!
I have tried doing the following but it doesn't work
df['Difference'] = df['Gun_time'] - df['Net_time']
for mean value I tried df['Gun_time'].mean
but it doesn't work either, please help!
Q.3 What if we have times in 28:48 (minutes and seconds) format and not 28:48:00 the function gives out a value error.
ValueError: expected hh:mm:ss format
Convert your columns to dtype timedelta, e.g. like
for col in ("Gun_time", "Net_time", "Pace"):
df[col] = pd.to_timedelta(df[col])
Now you can do calculations like
df['Gun_time'].mean()
# Timedelta('1 days 05:34:48')
or
df['Difference'] = df['Gun_time'] - df['Net_time']
#df['Difference']
# 0 0 days 00:01:00
# 1 0 days 00:01:00
# 2 0 days 00:01:00
# 3 0 days 00:00:00
# 4 0 days 00:01:00
# Name: Difference, dtype: timedelta64[ns]
If you need nicer output to string, you can use
def timedeltaToString(td):
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"
df['diffString'] = df['Difference'].apply(timedeltaToString)
# df['diffString']
# 0 00:01:00
# 1 00:01:00
# 2 00:01:00
# 3 00:00:00
# 4 00:01:00
#Name: diffString, dtype: object
See also Format timedelta to string.

Count String Values in Column across 30 Minute Time Bins using Pandas

I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1

Categories

Resources