Iterative time difference between two column entries in a data frame column - python

I have a pandas data frame that has a column of Date that is in this format as an example, 2022-07-22. The table is also below for a better understanding. I would like to get the time elapsed between each entry in hours. So far I have managed to get the elapsed time using this code:
startTime = data.Date.loc[1]
endTime = data.Date.loc[2]
T= endTime-startTime
seconds = T.total_seconds()
hours = seconds / 3600
print('Difference in hours: ', hours)
Now I would like to do this iteratively over the entire column. Any help with this will be appreciated. Here is a small section of the table to see what I mean:
Date
2022-07-22 15:35:13
2022-07-22 15:35:18
2022-07-22 15:35:23
2022-07-22 15:35:28

The 'Date' column is converted from string to datetime64[ns]. Here the difference function diff is used to calculate the 'dif ' column. Further, the differences are divided by the pd.Timedelta in one hour. I think the cycle is redundant here. And I added more hour difference in each value.
import pandas as pd
df = pd.DataFrame(
{'Date': ['2022-07-22 15:35:13', '2022-07-22 17:35:18', '2022-07-22 19:35:18', '2022-07-22 20:35:28']})
df['Date'] = pd.to_datetime(df['Date'], errors='raise')
df['dif'] = df['Date'].diff()
df['h'] = df['dif'] / pd.Timedelta('1 hour')
print(df)
Output
Date dif h
0 2022-07-22 15:35:13 NaT NaN
1 2022-07-22 17:35:18 0 days 02:00:05 2.001389
2 2022-07-22 19:35:18 0 days 02:00:00 2.000000
3 2022-07-22 20:35:28 0 days 01:00:10 1.002778
But, if you still need to iteratively, then you can do something like this:
a = 0
for i in range(1, len(df)):
a = df.loc[i, 'Date'] - df.loc[i-1, 'Date']
a = a / pd.Timedelta('1 hour')
print(a)

Related

Time elapsed since first log for each user

I'm trying to calculate the time difference between all the logs of a user and the first log of that same user. There are users with several logs.
The dataframe looks like this:
16 00000021601 2022-08-23 17:12:04
20 00000021601 2022-08-23 17:12:04
21 00000031313 2022-10-22 11:16:57
22 00000031313 2022-10-22 12:16:44
23 00000031313 2022-10-22 14:39:07
24 00000065137 2022-05-06 11:51:33
25 00000065137 2022-05-06 11:51:33
I know that I could do df['DELTA'] = df.groupby('ID')['DATE'].shift(-1) - df['DATE'] to get the difference between consecutive dates for each user, but since something like iat[0] doesn't work in this case I don't know how to get the difference in relation to the first date.
You can try this code
import pandas as pd
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
df.groupby('id').apply(lambda x: x['dates'] - x.iloc[0, 0])
Out:
id
1 0 0 days 00:00:00
1 0 days 00:00:00
2 59 days 18:04:53
2 3 0 days 00:00:00
4 0 days 02:22:23
5 -170 days +23:34:49
6 -170 days +23:34:49
Name: dates, dtype: timedelta64[ns]
If you dataframe is large and apply took a long time you can try use parallel-pandas. It's very simple
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
#p_apply is parallel analogue of apply method
df.groupby('id').p_apply(lambda x: x['dates'] - x.iloc[0, 0])
It will be 5-10 time faster

Adding a datetime column in pandas dataframe from minute values

I have a data frame where there is time columns having minutes from 0-1339 meaning 1440 minutes of a day. I want to add a column datetime representing the day 2021-3-21 including hh amd mm like this 1980-03-01 11:00 I tried following code
from datetime import datetime, timedelta
date = datetime.date(2021, 3, 21)
days = date - datetime.date(1900, 1, 1)
df['datetime'] = pd.to_datetime(df['time'],format='%H:%M:%S:%f') + pd.to_timedelta(days, unit='d')
But the error seems like descriptor 'date' requires a 'datetime.datetime' object but received a 'int'
Is there any other way to solve this problem or fixing this code? Please help to figure this out.
>>df
time
0
1
2
3
..
1339
I want to convert this minutes to particular format 1980-03-01 11:00 where I will use the date 2021-3-21 and convert the minutes tohhmm part. The dataframe will look like.
>df
datetime time
2021-3-21 00:00 0
2021-3-21 00:01 1
2021-3-21 00:02 2
...
How can I format my data in this way?
Let's try with pd.to_timedelta instead to get the duration in minutes from time then add a TimeStamp:
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
df.head():
time datetime
0 0 2021-03-21 00:00:00
1 1 2021-03-21 00:01:00
2 2 2021-03-21 00:02:00
3 3 2021-03-21 00:03:00
4 4 2021-03-21 00:04:00
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': np.arange(0, 1440)})
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
print(df)

Python Group by minutes in a day

I have log data that spans over 30 days. I'm am looking to group the data to see what 15 minute window has the lowest amount of events in total over 24hours. The data is formated as so:
2021-04-26 19:12:03, upload
2021-04-26 11:32:03, download
2021-04-24 19:14:03, download
2021-04-22 1:9:03, download
2021-04-19 4:12:03, upload
2021-04-07 7:12:03, download
and I'm looking for a result like
19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1
currently, I used grouper:
df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()
and my results are looking like\
date
2021-04-08 16:15:00+00:00 1
2021-04-08 16:30:00+00:00 20
2021-04-08 16:45:00+00:00 6
2021-04-08 17:00:00+00:00 6
2021-04-08 17:15:00+00:00 0
..
2021-04-29 18:00:00+00:00 3
2021-04-29 18:15:00+00:00 9
2021-04-29 18:30:00+00:00 0
2021-04-29 18:45:00+00:00 3
2021-04-29 19:00:00+00:00 15
Is there any way so I can not merge again on just the time and not include the date?
Do you want something like this?
Here, the idea is - If you're not concern about the date, then you can replace all the dates with some random date, and then you can group/count the data based on time data only.
df.Host = 1
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time
So let's say you want to gather in 5 min window. For this, you need to extract the time-stamp column. Let df is your pandas dataframe. For each time in timestamp, roundup that time to nearest multiple of 5 min and add to a counter map. See code below.
timestamp = df["timestamp"]
counter = collections.defaultdict(int)
def get_time(time):
hh, mm, ss = map(int, time.split(':'))
total_seconds = hh * 3600 + mm * 60 + ss
roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60)
# I suggest you to try out the above formula on paper for better understanding
# '5 min' means '5*60 sec' roundup
new_hh = roundup_seconds // 3600
roundup_seconds %= 3600
new_mm = roundup_seconds // 60
roundup_seconds %= 60
new_ss = roundup_seconds
return f"{new_hh}:{new_mm}:{new_ss}" # f-strings for python 3.6 and above
for time in timestamp:
counter[get_time(time)] += 1
# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned.
# Let me know if you need any further help :)
One approach is to use TimeDelta instead of DateTime since the comparison happens only between hours and minutes not dates.
import pandas as pd
import numpy as np
df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
'event': {0: 'upload', 1: 'download', 2: 'download',
3: 'download', 4: 'upload', 5: 'download'}})
# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])
# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()
# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000 # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))
# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]
# For Display
print(df.to_string(index=False))
Filtered Output:
time event
19:15:00 2
21:00:00 1
11:30:00 1
07:15:00 1
04:15:00 1

How to convert Pandas column into date type when values don't respect a pattern?

I have the fallowing dataFrame:
Timestamp real time
0 17FEB20:23:59:50 0.003
1 17FEB20:23:59:55 0.003
2 17FEB20:23:59:57 0.012
3 17FEB20:23:59:57 02:54.8
4 17FEB20:24:00:00 0.03
5 18FEB20:00:00:00 0
6 18FEB20:00:00:02 54.211
7 18FEB20:00:00:02 0.051
How to convert the columns to datetime64?
There're 2 things that is making this challengeable form me:
The column Timestamp, index 4 has the value: 17FEB20:24:00:00, which seems not to be a valid date-time (although it was output by a SAS program...).
The column real time don't fallow a pattern and seems it cannot be matched through a date_parser.
This is what I've tried to address the first column (Timestamp):
data['Timestamp'] = pd.to_datetime(
data['Timestamp'],
format='%d%b%y:%H:%M:%S')
But due the value of the index 4 (17FEB20:24:00:00) I get:
ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match). If I remove this line, it does work, but I have to find a way to address it, as my dataset have of thousands of lines and I cannot simply ignore them. Perhaps there's a way to convert it to zero hours of the next day?
Here's a snippet code to create the dataFrame sample as above to to gain some time working on the answer (if you need):
data = pd.DataFrame({
'Timestamp':[
'17FEB20:23:59:50',
'17FEB20:23:59:55',
'17FEB20:23:59:57',
'17FEB20:23:59:57',
'17FEB20:24:00:00',
'18FEB20:00:00:00',
'18FEB20:00:00:02',
'18FEB20:00:00:02'],
'real time': [
'0.003',
'0.003',
'0.012',
'02:54.8',
'0.03',
'0',
'54.211',
'0.051',
]})
Appreciate your help!
If your data is not too big, you might want to consider looping through the dataframe. You can do something like this.
for index, row in data.iterrows():
if row['Timestamp'][8:10] == '24':
date = (pd.to_datetime(row['Timestamp'][:7]).date() + pd.DateOffset(1)).strftime('%d%b%y').upper()
data.loc[index, 'Timestamp'] = date + ':00:00:00'
This is the result.
Timestamp real time
0 17FEB20:23:59:50 0.003
1 17FEB20:23:59:55 0.003
2 17FEB20:23:59:57 0.012
3 17FEB20:23:59:57 02:54.8
4 18FEB20:00:00:00 0.03
5 18FEB20:00:00:00 0
6 18FEB20:00:00:02 54.211
7 18FEB20:00:00:02 0.051
Here's how I addressed it:
For the column Timestamp, I've used this reply (Thanks #merit_2 for sharing it in the first comment).
For the column real time, I parse using some conditions.
Here's the code:
import os
import pandas as pd
from datetime import timedelta
# Parsing "real time" column:
## Apply mask '.000' to the microseconds
data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]
## apply mask over all '00:00:00.000'
placeholders = {
1: '00:00:00.00',
2: '00:00:00.0',
3: '00:00:00.',
4: '00:00:00',
5: '00:00:0',
6: '00:00:',
7: '00:00',
8: '00:0',
9: '00:',
10:'00',
11:'0'}
for cond_len in placeholders:
condition = data['real time'].str.len() == cond_len
data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']
# Parsing "Timestamp" column:
selrow = data['Timestamp'].str.contains('24:00')
data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)
# Convert to columns to datetime type:
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')
# check results:
display(data)
display(data.dtypes)
Here's the output:
Timestamp real time
0 2020-02-17 23:59:50 1900-01-01 00:00:00.003
1 2020-02-17 23:59:55 1900-01-01 00:00:00.003
2 2020-02-17 23:59:57 1900-01-01 00:00:00.012
3 2020-02-17 23:59:57 1900-01-01 00:02:54.800
4 2020-02-18 00:00:00 1900-01-01 00:00:00.030
5 2020-02-18 00:00:00 1900-01-01 00:00:00.000
6 2020-02-18 00:00:02 1900-01-01 00:00:54.211
7 2020-02-18 00:00:02 1900-01-01 00:00:00.051
Timestamp datetime64[ns]
real time datetime64[ns]
Perhaps there's a clever way to do that, but for now it suits.

Pandas offset DatetimeIndex to next business if date is not a business day

I have a DataFrame which is indexed with the last day of the month. Sometimes this date is a weekday and sometimes it is a weekend. Ignoring holidays, I'm looking to offset the date to the next business date if the date is on a weekend and leave the result unchanged if it is already on a weekday.
Some example data would be
import pandas as pd
idx = [pd.to_datetime('20150430'), pd.to_datetime('20150531'),
pd.to_datetime('20150630')]
df = pd.DataFrame(0, index=idx, columns=['A'])
df
A
2015-04-30 0
2015-05-31 0
2015-06-30 0
df.index.weekday
array([3, 6, 1], dtype=int32)
Something like the following works, however I would appreciate if someone has a solution that is a little more straightforward.
idx = df.index.copy()
wknds = (idx.weekday == 5) | (idx.weekday == 6)
idx2 = idx[~wknds]
idx2 = idx2.append(idx[wknds] + pd.datetools.BDay(1))
idx2 = idx2.order()
df.index = idx2
df
A
2015-04-30 0
2015-06-01 0
2015-06-30 0
You can add 0*BDay()
from pandas.tseries.offsets import BDay
df.index = df.index.map(lambda x : x + 0*BDay())
You can also use this with a Holiday calendar with CDay(calendar) in case there are holidays.
You can map the index with a lambda function, and set the result back to the index.
df.index = df.index.map(lambda x: x if x.dayofweek < 5 else x + pd.DateOffset(7-x.dayofweek))
df
A
2015-04-30 0
2015-06-01 0
2015-06-30 0
Using DataFrame.resample
A more idiomatic method would be to resample to business days:
df.resample('B', label='right', closed='right').first().dropna()
A
2015-04-30 0.0
2015-06-01 0.0
2015-06-30 0.0
Can also use a variation of the logic: a)given input date = 'inputdate', go back one business day using pandas date_range which has business days input; then b) go forward one business day using the same. To do this, you generate a vector with 2 inputs using data_range and select the min or max value to return the appropriate single value. So this could look as follows:
a) get business day before:
date_1b_bef = min(pd.date_range(start=inputdate, periods = 2, freq='-1B'))
b) get business day after the 'business day before':
date_1b_aft = max(pd.date_range(start=date_1b_bef, periods = 2, freq='1B'))
or substituting a) into b) to get one line:
date_1b_aft = max(pd.date_range(start=min(pd.date_range(start=inputdate, periods = 2, freq='-1B')), periods = 2, freq='1B'))
This can also be used with relativedelta to get the business day after some calendar period offset from inputdate. For example:
a) get the business day (using 'following' convention if offset day is not a business day) for 1 calendar month prior to 'input date':
date_1mbef_fol = max(pd.date_range(min(pd.date_range(start=inputdate + relativedelta(months=-1), periods = 2, freq='-1B')), periods = 2, freq = '1B'))
b) get the business day (using 'preceding' convention if offset day is not a business day) for 1 year prior to 'input date':
date_1ybef_pre = min(pd.date_range(max(pd.date_range(start=inputdate + relativedelta(years=-1), periods = 2, freq='1B')), periods = 2, freq = '-1B'))

Categories

Resources