Python Group by minutes in a day - python

I have log data that spans over 30 days. I'm am looking to group the data to see what 15 minute window has the lowest amount of events in total over 24hours. The data is formated as so:
2021-04-26 19:12:03, upload
2021-04-26 11:32:03, download
2021-04-24 19:14:03, download
2021-04-22 1:9:03, download
2021-04-19 4:12:03, upload
2021-04-07 7:12:03, download
and I'm looking for a result like
19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1
currently, I used grouper:
df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()
and my results are looking like\
date
2021-04-08 16:15:00+00:00 1
2021-04-08 16:30:00+00:00 20
2021-04-08 16:45:00+00:00 6
2021-04-08 17:00:00+00:00 6
2021-04-08 17:15:00+00:00 0
..
2021-04-29 18:00:00+00:00 3
2021-04-29 18:15:00+00:00 9
2021-04-29 18:30:00+00:00 0
2021-04-29 18:45:00+00:00 3
2021-04-29 19:00:00+00:00 15
Is there any way so I can not merge again on just the time and not include the date?

Do you want something like this?
Here, the idea is - If you're not concern about the date, then you can replace all the dates with some random date, and then you can group/count the data based on time data only.
df.Host = 1
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time

So let's say you want to gather in 5 min window. For this, you need to extract the time-stamp column. Let df is your pandas dataframe. For each time in timestamp, roundup that time to nearest multiple of 5 min and add to a counter map. See code below.
timestamp = df["timestamp"]
counter = collections.defaultdict(int)
def get_time(time):
hh, mm, ss = map(int, time.split(':'))
total_seconds = hh * 3600 + mm * 60 + ss
roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60)
# I suggest you to try out the above formula on paper for better understanding
# '5 min' means '5*60 sec' roundup
new_hh = roundup_seconds // 3600
roundup_seconds %= 3600
new_mm = roundup_seconds // 60
roundup_seconds %= 60
new_ss = roundup_seconds
return f"{new_hh}:{new_mm}:{new_ss}" # f-strings for python 3.6 and above
for time in timestamp:
counter[get_time(time)] += 1
# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned.
# Let me know if you need any further help :)

One approach is to use TimeDelta instead of DateTime since the comparison happens only between hours and minutes not dates.
import pandas as pd
import numpy as np
df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
'event': {0: 'upload', 1: 'download', 2: 'download',
3: 'download', 4: 'upload', 5: 'download'}})
# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])
# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()
# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000 # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))
# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]
# For Display
print(df.to_string(index=False))
Filtered Output:
time event
19:15:00 2
21:00:00 1
11:30:00 1
07:15:00 1
04:15:00 1

Related

Time elapsed since first log for each user

I'm trying to calculate the time difference between all the logs of a user and the first log of that same user. There are users with several logs.
The dataframe looks like this:
16 00000021601 2022-08-23 17:12:04
20 00000021601 2022-08-23 17:12:04
21 00000031313 2022-10-22 11:16:57
22 00000031313 2022-10-22 12:16:44
23 00000031313 2022-10-22 14:39:07
24 00000065137 2022-05-06 11:51:33
25 00000065137 2022-05-06 11:51:33
I know that I could do df['DELTA'] = df.groupby('ID')['DATE'].shift(-1) - df['DATE'] to get the difference between consecutive dates for each user, but since something like iat[0] doesn't work in this case I don't know how to get the difference in relation to the first date.
You can try this code
import pandas as pd
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
df.groupby('id').apply(lambda x: x['dates'] - x.iloc[0, 0])
Out:
id
1 0 0 days 00:00:00
1 0 days 00:00:00
2 59 days 18:04:53
2 3 0 days 00:00:00
4 0 days 02:22:23
5 -170 days +23:34:49
6 -170 days +23:34:49
Name: dates, dtype: timedelta64[ns]
If you dataframe is large and apply took a long time you can try use parallel-pandas. It's very simple
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
#p_apply is parallel analogue of apply method
df.groupby('id').p_apply(lambda x: x['dates'] - x.iloc[0, 0])
It will be 5-10 time faster

Iterative time difference between two column entries in a data frame column

I have a pandas data frame that has a column of Date that is in this format as an example, 2022-07-22. The table is also below for a better understanding. I would like to get the time elapsed between each entry in hours. So far I have managed to get the elapsed time using this code:
startTime = data.Date.loc[1]
endTime = data.Date.loc[2]
T= endTime-startTime
seconds = T.total_seconds()
hours = seconds / 3600
print('Difference in hours: ', hours)
Now I would like to do this iteratively over the entire column. Any help with this will be appreciated. Here is a small section of the table to see what I mean:
Date
2022-07-22 15:35:13
2022-07-22 15:35:18
2022-07-22 15:35:23
2022-07-22 15:35:28
The 'Date' column is converted from string to datetime64[ns]. Here the difference function diff is used to calculate the 'dif ' column. Further, the differences are divided by the pd.Timedelta in one hour. I think the cycle is redundant here. And I added more hour difference in each value.
import pandas as pd
df = pd.DataFrame(
{'Date': ['2022-07-22 15:35:13', '2022-07-22 17:35:18', '2022-07-22 19:35:18', '2022-07-22 20:35:28']})
df['Date'] = pd.to_datetime(df['Date'], errors='raise')
df['dif'] = df['Date'].diff()
df['h'] = df['dif'] / pd.Timedelta('1 hour')
print(df)
Output
Date dif h
0 2022-07-22 15:35:13 NaT NaN
1 2022-07-22 17:35:18 0 days 02:00:05 2.001389
2 2022-07-22 19:35:18 0 days 02:00:00 2.000000
3 2022-07-22 20:35:28 0 days 01:00:10 1.002778
But, if you still need to iteratively, then you can do something like this:
a = 0
for i in range(1, len(df)):
a = df.loc[i, 'Date'] - df.loc[i-1, 'Date']
a = a / pd.Timedelta('1 hour')
print(a)

How to create a binary variable based on date ranges

I would like to flag all rows with dates 1 week before and 1 week after a specific holiday to be = 1; = 0 otherwise.
What's the best way to do so? Below are my codes, which only flag New Year's Day to be new_year = 1. What I want is all 3 rows to have new_year = 1 (since they fall within 1 week before and after New Year's Day).
Note: I would like the code to work for any holidays (e.g. Thanksgiving, Easter, etc.).
Thank you!
# importing pandas as pd
import pandas as pd
import holidays
# Creating the dataframe
df = pd.DataFrame({'Date': ['1/1/2019', '1/5/2019', '12/28/2018'],
'Event': ['Music', 'Poetry', 'Theatre'],
'Cost': [10000, 5000, 15000]})
df['newDate'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
new_year = holidays.HolidayBase()
new_year.append({"2018-01-01": "New Year's Day",
"2019-01-01": "New Year's Day"})
df['hol_new_year'] = np.where(df['newDate'] in new_year, 1, 0)
You can use pandas' time series offsets:
ye = pd.tseries.offsets.YearEnd()
yb = pd.tseries.offsets.YearBegin()
d = pd.to_timedelta('1w')
s = df['newDate']
df['hol_new_year'] = (s.between(s-ye-d, s-ye+d)
|s.between(s+yb-d, s+yb+d)
).astype(int)
Output:
Date Event Cost newDate hol_new_year
0 1/1/2019 Music 10000 2019-01-01 1
1 1/5/2019 Poetry 5000 2019-01-05 1
2 12/28/2018 Theatre 15000 2018-12-28 1
3 1/15/2021 SO 0 2021-01-15 0

How to convert Pandas column into date type when values don't respect a pattern?

I have the fallowing dataFrame:
Timestamp real time
0 17FEB20:23:59:50 0.003
1 17FEB20:23:59:55 0.003
2 17FEB20:23:59:57 0.012
3 17FEB20:23:59:57 02:54.8
4 17FEB20:24:00:00 0.03
5 18FEB20:00:00:00 0
6 18FEB20:00:00:02 54.211
7 18FEB20:00:00:02 0.051
How to convert the columns to datetime64?
There're 2 things that is making this challengeable form me:
The column Timestamp, index 4 has the value: 17FEB20:24:00:00, which seems not to be a valid date-time (although it was output by a SAS program...).
The column real time don't fallow a pattern and seems it cannot be matched through a date_parser.
This is what I've tried to address the first column (Timestamp):
data['Timestamp'] = pd.to_datetime(
data['Timestamp'],
format='%d%b%y:%H:%M:%S')
But due the value of the index 4 (17FEB20:24:00:00) I get:
ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match). If I remove this line, it does work, but I have to find a way to address it, as my dataset have of thousands of lines and I cannot simply ignore them. Perhaps there's a way to convert it to zero hours of the next day?
Here's a snippet code to create the dataFrame sample as above to to gain some time working on the answer (if you need):
data = pd.DataFrame({
'Timestamp':[
'17FEB20:23:59:50',
'17FEB20:23:59:55',
'17FEB20:23:59:57',
'17FEB20:23:59:57',
'17FEB20:24:00:00',
'18FEB20:00:00:00',
'18FEB20:00:00:02',
'18FEB20:00:00:02'],
'real time': [
'0.003',
'0.003',
'0.012',
'02:54.8',
'0.03',
'0',
'54.211',
'0.051',
]})
Appreciate your help!
If your data is not too big, you might want to consider looping through the dataframe. You can do something like this.
for index, row in data.iterrows():
if row['Timestamp'][8:10] == '24':
date = (pd.to_datetime(row['Timestamp'][:7]).date() + pd.DateOffset(1)).strftime('%d%b%y').upper()
data.loc[index, 'Timestamp'] = date + ':00:00:00'
This is the result.
Timestamp real time
0 17FEB20:23:59:50 0.003
1 17FEB20:23:59:55 0.003
2 17FEB20:23:59:57 0.012
3 17FEB20:23:59:57 02:54.8
4 18FEB20:00:00:00 0.03
5 18FEB20:00:00:00 0
6 18FEB20:00:00:02 54.211
7 18FEB20:00:00:02 0.051
Here's how I addressed it:
For the column Timestamp, I've used this reply (Thanks #merit_2 for sharing it in the first comment).
For the column real time, I parse using some conditions.
Here's the code:
import os
import pandas as pd
from datetime import timedelta
# Parsing "real time" column:
## Apply mask '.000' to the microseconds
data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]
## apply mask over all '00:00:00.000'
placeholders = {
1: '00:00:00.00',
2: '00:00:00.0',
3: '00:00:00.',
4: '00:00:00',
5: '00:00:0',
6: '00:00:',
7: '00:00',
8: '00:0',
9: '00:',
10:'00',
11:'0'}
for cond_len in placeholders:
condition = data['real time'].str.len() == cond_len
data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']
# Parsing "Timestamp" column:
selrow = data['Timestamp'].str.contains('24:00')
data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)
# Convert to columns to datetime type:
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')
# check results:
display(data)
display(data.dtypes)
Here's the output:
Timestamp real time
0 2020-02-17 23:59:50 1900-01-01 00:00:00.003
1 2020-02-17 23:59:55 1900-01-01 00:00:00.003
2 2020-02-17 23:59:57 1900-01-01 00:00:00.012
3 2020-02-17 23:59:57 1900-01-01 00:02:54.800
4 2020-02-18 00:00:00 1900-01-01 00:00:00.030
5 2020-02-18 00:00:00 1900-01-01 00:00:00.000
6 2020-02-18 00:00:02 1900-01-01 00:00:54.211
7 2020-02-18 00:00:02 1900-01-01 00:00:00.051
Timestamp datetime64[ns]
real time datetime64[ns]
Perhaps there's a clever way to do that, but for now it suits.

Python Optimization - Can I avoid the double for loop?

I have a df like the following:
import datetime as dt
import pandas as pd
import pytz
cols = ['utc_datetimes', 'zone_name']
data = [
['2019-11-13 14:41:26,2019-12-18 23:04:12', 'Europe/Stockholm'],
['2019-12-06 21:49:04,2019-12-11 22:52:57,2019-12-18 20:30:58,2019-12-23 18:49:53,2019-12-27 18:34:23,2020-01-07 21:20:51,2020-01-11 17:36:56,2020-01-20 21:45:47,2020-01-30 20:48:49,2020-02-03 21:04:52,2020-02-07 20:05:02,2020-02-10 21:07:21', 'Europe/London']
]
df = pd.DataFrame(data, columns=cols)
print(df)
# utc_datetimes zone_name
# 0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm
# 1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London
And I would like to count the number of nights and Wednesdays, of the row's local time, the dates in the df represent. This is the desired output:
utc_datetimes zone_name nights wednesdays
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 0 1
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 11 2
I've come up with the following double for loop, but it is not as efficient as I'd like it for the sizable df:
# New columns.
df['nights'] = 0
df['wednesdays'] = 0
for row in range(df.shape[0]):
date_list = df['utc_datetimes'].iloc[row].split(',')
user_time_zone = df['zone_name'].iloc[row]
for date in date_list:
datetime_obj = dt.datetime.strptime(
date, '%Y-%m-%d %H:%M:%S'
).replace(tzinfo=pytz.utc)
local_datetime = datetime_obj.astimezone(pytz.timezone(user_time_zone))
# Get day of the week count:
if local_datetime.weekday() == 2:
df['wednesdays'].iloc[row] += 1
# Get time of the day count:
if (local_datetime.hour >17) & (local_datetime.hour <= 23):
df['nights'].iloc[row] += 1
Any suggestions will be appreciated :)
PD. disregard the definition of 'night', just an example.
One way is to first create a reference df by exploding your utc_datetimes column and then get the TimeDelta for each zone:
df = pd.DataFrame(data, columns=cols)
s = (df.assign(utc_datetimes=df["utc_datetimes"].str.split(","))
.explode("utc_datetimes"))
s["diff"] = [pd.Timestamp(a, tz=b).utcoffset() for a,b in zip(s["utc_datetimes"],s["zone_name"])]
With this helper df you can calculate the number of wednesdays and nights:
df["wednesdays"] = (pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.day_name().eq("Wednesday").groupby(level=0).sum()
df["nights"] = ((pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.hour>17).groupby(level=0).sum()
print (df)
#
utc_datetimes zone_name wednesdays nights
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 1.0 0.0
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 2.0 11.0

Categories

Resources