Time elapsed since first log for each user - python

I'm trying to calculate the time difference between all the logs of a user and the first log of that same user. There are users with several logs.
The dataframe looks like this:
16 00000021601 2022-08-23 17:12:04
20 00000021601 2022-08-23 17:12:04
21 00000031313 2022-10-22 11:16:57
22 00000031313 2022-10-22 12:16:44
23 00000031313 2022-10-22 14:39:07
24 00000065137 2022-05-06 11:51:33
25 00000065137 2022-05-06 11:51:33
I know that I could do df['DELTA'] = df.groupby('ID')['DATE'].shift(-1) - df['DATE'] to get the difference between consecutive dates for each user, but since something like iat[0] doesn't work in this case I don't know how to get the difference in relation to the first date.

You can try this code
import pandas as pd
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
df.groupby('id').apply(lambda x: x['dates'] - x.iloc[0, 0])
Out:
id
1 0 0 days 00:00:00
1 0 days 00:00:00
2 59 days 18:04:53
2 3 0 days 00:00:00
4 0 days 02:22:23
5 -170 days +23:34:49
6 -170 days +23:34:49
Name: dates, dtype: timedelta64[ns]
If you dataframe is large and apply took a long time you can try use parallel-pandas. It's very simple
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
#p_apply is parallel analogue of apply method
df.groupby('id').p_apply(lambda x: x['dates'] - x.iloc[0, 0])
It will be 5-10 time faster

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Calculating values from time series in pandas multi-indexed pivot tables

I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:

Python Group by minutes in a day

I have log data that spans over 30 days. I'm am looking to group the data to see what 15 minute window has the lowest amount of events in total over 24hours. The data is formated as so:
2021-04-26 19:12:03, upload
2021-04-26 11:32:03, download
2021-04-24 19:14:03, download
2021-04-22 1:9:03, download
2021-04-19 4:12:03, upload
2021-04-07 7:12:03, download
and I'm looking for a result like
19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1
currently, I used grouper:
df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()
and my results are looking like\
date
2021-04-08 16:15:00+00:00 1
2021-04-08 16:30:00+00:00 20
2021-04-08 16:45:00+00:00 6
2021-04-08 17:00:00+00:00 6
2021-04-08 17:15:00+00:00 0
..
2021-04-29 18:00:00+00:00 3
2021-04-29 18:15:00+00:00 9
2021-04-29 18:30:00+00:00 0
2021-04-29 18:45:00+00:00 3
2021-04-29 19:00:00+00:00 15
Is there any way so I can not merge again on just the time and not include the date?
Do you want something like this?
Here, the idea is - If you're not concern about the date, then you can replace all the dates with some random date, and then you can group/count the data based on time data only.
df.Host = 1
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time
So let's say you want to gather in 5 min window. For this, you need to extract the time-stamp column. Let df is your pandas dataframe. For each time in timestamp, roundup that time to nearest multiple of 5 min and add to a counter map. See code below.
timestamp = df["timestamp"]
counter = collections.defaultdict(int)
def get_time(time):
hh, mm, ss = map(int, time.split(':'))
total_seconds = hh * 3600 + mm * 60 + ss
roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60)
# I suggest you to try out the above formula on paper for better understanding
# '5 min' means '5*60 sec' roundup
new_hh = roundup_seconds // 3600
roundup_seconds %= 3600
new_mm = roundup_seconds // 60
roundup_seconds %= 60
new_ss = roundup_seconds
return f"{new_hh}:{new_mm}:{new_ss}" # f-strings for python 3.6 and above
for time in timestamp:
counter[get_time(time)] += 1
# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned.
# Let me know if you need any further help :)
One approach is to use TimeDelta instead of DateTime since the comparison happens only between hours and minutes not dates.
import pandas as pd
import numpy as np
df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
'event': {0: 'upload', 1: 'download', 2: 'download',
3: 'download', 4: 'upload', 5: 'download'}})
# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])
# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()
# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000 # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))
# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]
# For Display
print(df.to_string(index=False))
Filtered Output:
time event
19:15:00 2
21:00:00 1
11:30:00 1
07:15:00 1
04:15:00 1

Count String Values in Column across 30 Minute Time Bins using Pandas

I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1

Add months to a date in Pandas

I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]

Categories

Resources