Having a dataframe like that:
Desirable result is to get aggregated IDs with time diffs between Start and End looking like that:
Tried simple groupings and diffs but it does not work:
df[df['Name'] == 'Start'].groupby('ID')['Time']-\
df[df['Name'] == 'End'].groupby('ID')['Time']
How this task can be done in pandas? Thanks!
A possible solution is to join the table on itself like this:
df_start = df[df['Name'] == 'Start']
df_end = df[df['Name'] == 'End']
df_merge = df_start.merge(df_end, on='id', suffixes=('_start', '_end'))
df_merge['diff'] = df_merge['Time_end'] - df_merge['Time_start']
print(df_merge.to_string())
Output:
id Name_start Time_start Name_end Time_end diff
0 1 Start 2017-11-02 12:00:14 End 2017-11-07 22:45:13 5 days 10:44:59
1 2 Start 2018-01-28 06:53:09 End 2018-02-05 13:31:14 8 days 06:38:05
Here you go.
Generate data:
df = pd.DataFrame({'ID':[1, 1,2, 2],
'Name': ['Start', 'End', 'Start', 'End'],
'Time': [pd.datetime(2020, 1,1,0,1,0), pd.datetime(2020, 1,2,0,0,0),
pd.datetime(2020, 1,1,0,0,0), pd.datetime(2020, 1,2,0,0,0)]})
Get TimeDelta:
df_agg = df[df['Name'] == 'Start'].reset_index()[['ID', 'Time']]
df_agg = df_agg.rename(columns={"Time": "Start"})
df_agg['End'] = df[df['Name'] == 'End'].reset_index()['Time']
df_agg['TimeDelta'] = df_agg['End'] - df_agg['Start']
Get timediff as decimal value in days, like your example:
df_agg['TimeDiff_days'] = df_agg['TimeDelta'] / np.timedelta64(1,'D')
df_agg
Result:
ID Start End TimeDelta TimeDiff_days
0 1 2020-01-01 00:01:00 2020-01-02 0 days 23:59:00 0.999306
1 2 2020-01-01 00:00:00 2020-01-02 1 days 00:00:00 1.000000
Related
I need to prepare data with time intervals for machine learning in the way that I get equal spacing between timestamps. For example, for 3 hours spacing, I would like to have the following timestamps: 00:00, 03:00, 6:00, 9:00, 12:00, 15:00... For example:
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
print(df)
Output:
Start End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00 a
1 2022-07-01 22:30:00 2022-07-02 03:30:00 b
I try to get timestamps:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])
Output:
Datetime Val
0 2022-07-01 12:00:00 a
0 2022-07-01 14:00:00 a
0 2022-07-01 18:00:00 a
1 2022-07-01 22:00:00 b
1 2022-07-02 02:00:00 b
As you can see, those timestamps are not equally spaced. My expected result:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
We can use the function merge_asof:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max() + pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())
Output:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
Annotated code
# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()
# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')})
# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')
# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])
Result
datetime Val
0 2022-07-01 12:00:00 a
1 2022-07-01 15:00:00 a
2 2022-07-01 18:00:00 a
3 2022-07-01 21:00:00 NaN
4 2022-07-02 00:00:00 b
5 2022-07-02 03:00:00 b
For this problem I really like to use pd.DataFrame.reindex. In particular, you can specify the method='nearest, and a tolerance='90m' to ensure you leave gaps where you need them.
You can create you regular spaced time series using pd.date_range with start and end arguments using the .floor('3H') and .ceil('3H') methods, respectively.
import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
result = pd.DataFrame()
for name, group in df.groupby('Val'):
group = group.set_index('Datetime')
group.index = group.index.ceil('1H')
idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
group = group.reindex(idx, tolerance = '90m', method='nearest')
result = pd.concat([result, group])
result = result.sort_index()
which returns:
Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b
Another method would be to simply add the timehours to the start time in a loop.
from datetime import datetime,timedelta
#taking start time as current time just for example
start = datetime.now()
#taking end time as current time + 15 hours just for example
end = datetime.now() + timedelta(hours = 15)
times = []
while end>start:
start = start+timedelta(hours = 3)
print(start)
times.append(start)
df = pd.Dataframe(columns = ['Times'])
df['Times'] = times
Output
Times
0 2022-07-15 01:28:56.912013
1 2022-07-15 04:28:56.912013
2 2022-07-15 07:28:56.912013
3 2022-07-15 10:28:56.912013
4 2022-07-15 13:28:56.912013
I am trying to derive a mean value for the average duration spent in a specific status by ID.
For this I first sort my data frame by ID and date, and with the apply and shift function trying to deduct the date of row[i+1] - row[i] - given row[i+1] - row[i] are for the same ID.
I get the following exception: AttributeError: 'int' object has no attribute 'shift'
Below a code for simulation:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)
Can anyone please advise how to solve the last line with the lambda function?
I was not able to get it done with lambda. You can try like this:
import datetime
today = datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)
#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)
# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)
print(frame_ordered)
Output:
#mask
0 False
4 False
2 False
3 True
1 False
Name: id, dtype: bool
#frame_ordered
id status date duration
0 1245 1 2022-07-01 25.0
4 1248 6 2022-01-03 204.0
2 2345 4 2022-04-20 97.0
3 4556 5 2022-02-02 38.0
1 4556 2 2022-03-12 136.0
I think that the values were not interpreted as pandas Timestamps. With the right conversion it should be easy though:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)
which prints out
id status date shifted Difference
0 1245 1 2022-07-01 NaT NaT
4 1248 6 2022-01-03 2022-07-01 -179 days
2 2345 4 2022-04-20 2022-01-03 107 days
3 4556 5 2022-02-02 2022-04-20 -77 days
1 4556 2 2022-03-12 2022-02-02 38 days
I have a df like the following:
import datetime as dt
import pandas as pd
import pytz
cols = ['utc_datetimes', 'zone_name']
data = [
['2019-11-13 14:41:26,2019-12-18 23:04:12', 'Europe/Stockholm'],
['2019-12-06 21:49:04,2019-12-11 22:52:57,2019-12-18 20:30:58,2019-12-23 18:49:53,2019-12-27 18:34:23,2020-01-07 21:20:51,2020-01-11 17:36:56,2020-01-20 21:45:47,2020-01-30 20:48:49,2020-02-03 21:04:52,2020-02-07 20:05:02,2020-02-10 21:07:21', 'Europe/London']
]
df = pd.DataFrame(data, columns=cols)
print(df)
# utc_datetimes zone_name
# 0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm
# 1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London
And I would like to count the number of nights and Wednesdays, of the row's local time, the dates in the df represent. This is the desired output:
utc_datetimes zone_name nights wednesdays
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 0 1
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 11 2
I've come up with the following double for loop, but it is not as efficient as I'd like it for the sizable df:
# New columns.
df['nights'] = 0
df['wednesdays'] = 0
for row in range(df.shape[0]):
date_list = df['utc_datetimes'].iloc[row].split(',')
user_time_zone = df['zone_name'].iloc[row]
for date in date_list:
datetime_obj = dt.datetime.strptime(
date, '%Y-%m-%d %H:%M:%S'
).replace(tzinfo=pytz.utc)
local_datetime = datetime_obj.astimezone(pytz.timezone(user_time_zone))
# Get day of the week count:
if local_datetime.weekday() == 2:
df['wednesdays'].iloc[row] += 1
# Get time of the day count:
if (local_datetime.hour >17) & (local_datetime.hour <= 23):
df['nights'].iloc[row] += 1
Any suggestions will be appreciated :)
PD. disregard the definition of 'night', just an example.
One way is to first create a reference df by exploding your utc_datetimes column and then get the TimeDelta for each zone:
df = pd.DataFrame(data, columns=cols)
s = (df.assign(utc_datetimes=df["utc_datetimes"].str.split(","))
.explode("utc_datetimes"))
s["diff"] = [pd.Timestamp(a, tz=b).utcoffset() for a,b in zip(s["utc_datetimes"],s["zone_name"])]
With this helper df you can calculate the number of wednesdays and nights:
df["wednesdays"] = (pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.day_name().eq("Wednesday").groupby(level=0).sum()
df["nights"] = ((pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.hour>17).groupby(level=0).sum()
print (df)
#
utc_datetimes zone_name wednesdays nights
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 1.0 0.0
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 2.0 11.0
I have a dataset like this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 B456 2019-10-01 2019-08-01 2019-09-01
3 B456 2019-10-01 2019-09-01 2019-10-01
generated by this code:
from pandas import DataFrame
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
I'm trying to write a function to achieve this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 A123 2020-01-02 2019-03-02 2019-04-02
3 A123 2020-01-02 2019-04-02 2019-05-02
4 A123 2020-01-02 2019-05-02 2019-06-02
5 A123 2020-01-02 2019-06-02 2019-07-02
6 A123 2020-01-02 2019-07-02 2019-08-02
7 A123 2020-01-02 2019-08-02 2019-09-02
8 A123 2020-01-02 2019-09-02 2019-10-02
9 A123 2020-01-02 2019-10-02 2019-11-02
10 A123 2020-01-02 2019-11-02 2019-12-02
11 A123 2020-01-02 2019-12-02 2020-01-02
12 B456 2019-10-01 2019-08-01 2019-09-01
13 B456 2019-10-01 2019-09-01 2019-10-01
Essentially the function should keep adding row, for each user_id while the max(end_date) is less than or equal to lapsed_date. The newly added row will take previous row's end_date as start_date, and previous row's end_date + 1 month as end_date.
I have generated this function below.
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
It works with all the logic above, except the while loop doesn't work. So I have to apply this function using this apply command manually 10 times:
df = df.groupby('user_id').apply(add_row).reset_index(drop = True)
I'm not really sure what I did wrong with the while loop there. Any advice would be highly appreciated!
So there are a few reasons your loop did not work, I will explain them as we go!
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
In the above, you call return which returns the result to the code that called the function. This essentially stops your loop from iterating multiple times and returns the result of the first append.
return x.append(last_row) Another caveat here is that dataframe.append() does not actually append to the dataframe, you need to call x = x.append(last_row)
Pandas Append
Secondly, I noted that it may be required to do this over multiple, unique user_id rows. Due to this, in the code below, I have split the dataframe into multiple frames, dictated by the total unique user_id's stored in the frame.
Here is how you can get this to work;
import pandas as pd
from pandas import DataFrame
def add_row(df):
while df['end_date'].max() < df['lapsed_date'].max():
new_row = {'user_id': df['user_id'][0],
'lapsed_date': df['lapsed_date'].max(),
'start_date': df['end_date'].max(),
'end_date': df['end_date'].max() + pd.DateOffset(months=1),
}
df = df.append(new_row, ignore_index = True)
return df ## Note the return is called OUTSIDE of the while loop, ensuring only the final result is returned.
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
ids = df['user_id'].unique()
g = df.groupby(['user_id'])
result = pd.DataFrame(columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
for i in ids:
group = g.get_group(i)
result = result.append(add_row(group), ignore_index=True)
print(result)
Split the frames based on unique user id's
Create empty data frame to store result in under result
Iterate over all user_id's
Run the same while loop, ensuring that df is updated with the append rows
Return the result and print
Hope this helps!
I have a pandas DataFrame with the following content:
df =
start end
01/April 02/May
12/April 12/April
I need to add a column with the difference (in days) between end and start values (end - start).
How can I do it?
I tried the following:
import pandas as pd
df.startdate = pd.datetime(df.start, format='%B/%d')
df.enddate = pd.datetime(df.end, format='%B/%d')
But not sure if this is a right direction.
import pandas as pd
df = pd.DataFrame({"start":["01/April", "12/April"], "end": ["02/May", "12/April"]})
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
df["diff"] = (df["end"] - df["start"])
Output:
end start diff
0 2018-05-02 2018-04-01 31 days
1 2018-04-12 2018-04-12 0 days
This is one way.
df['start'] = pd.to_datetime(df['start']+'/2018', format='%d/%B/%Y')
df['end'] = pd.to_datetime(df['end']+'/2018', format='%d/%B/%Y')
df['diff'] = df['end'] - df['start']
# start end diff
# 0 2018-04-01 2018-05-02 31 days
# 1 2018-04-12 2018-04-12 0 days