I have a dataset like this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 B456 2019-10-01 2019-08-01 2019-09-01
3 B456 2019-10-01 2019-09-01 2019-10-01
generated by this code:
from pandas import DataFrame
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
I'm trying to write a function to achieve this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 A123 2020-01-02 2019-03-02 2019-04-02
3 A123 2020-01-02 2019-04-02 2019-05-02
4 A123 2020-01-02 2019-05-02 2019-06-02
5 A123 2020-01-02 2019-06-02 2019-07-02
6 A123 2020-01-02 2019-07-02 2019-08-02
7 A123 2020-01-02 2019-08-02 2019-09-02
8 A123 2020-01-02 2019-09-02 2019-10-02
9 A123 2020-01-02 2019-10-02 2019-11-02
10 A123 2020-01-02 2019-11-02 2019-12-02
11 A123 2020-01-02 2019-12-02 2020-01-02
12 B456 2019-10-01 2019-08-01 2019-09-01
13 B456 2019-10-01 2019-09-01 2019-10-01
Essentially the function should keep adding row, for each user_id while the max(end_date) is less than or equal to lapsed_date. The newly added row will take previous row's end_date as start_date, and previous row's end_date + 1 month as end_date.
I have generated this function below.
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
It works with all the logic above, except the while loop doesn't work. So I have to apply this function using this apply command manually 10 times:
df = df.groupby('user_id').apply(add_row).reset_index(drop = True)
I'm not really sure what I did wrong with the while loop there. Any advice would be highly appreciated!
So there are a few reasons your loop did not work, I will explain them as we go!
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
In the above, you call return which returns the result to the code that called the function. This essentially stops your loop from iterating multiple times and returns the result of the first append.
return x.append(last_row) Another caveat here is that dataframe.append() does not actually append to the dataframe, you need to call x = x.append(last_row)
Pandas Append
Secondly, I noted that it may be required to do this over multiple, unique user_id rows. Due to this, in the code below, I have split the dataframe into multiple frames, dictated by the total unique user_id's stored in the frame.
Here is how you can get this to work;
import pandas as pd
from pandas import DataFrame
def add_row(df):
while df['end_date'].max() < df['lapsed_date'].max():
new_row = {'user_id': df['user_id'][0],
'lapsed_date': df['lapsed_date'].max(),
'start_date': df['end_date'].max(),
'end_date': df['end_date'].max() + pd.DateOffset(months=1),
}
df = df.append(new_row, ignore_index = True)
return df ## Note the return is called OUTSIDE of the while loop, ensuring only the final result is returned.
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
ids = df['user_id'].unique()
g = df.groupby(['user_id'])
result = pd.DataFrame(columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
for i in ids:
group = g.get_group(i)
result = result.append(add_row(group), ignore_index=True)
print(result)
Split the frames based on unique user id's
Create empty data frame to store result in under result
Iterate over all user_id's
Run the same while loop, ensuring that df is updated with the append rows
Return the result and print
Hope this helps!
Related
I am trying to derive a mean value for the average duration spent in a specific status by ID.
For this I first sort my data frame by ID and date, and with the apply and shift function trying to deduct the date of row[i+1] - row[i] - given row[i+1] - row[i] are for the same ID.
I get the following exception: AttributeError: 'int' object has no attribute 'shift'
Below a code for simulation:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)
Can anyone please advise how to solve the last line with the lambda function?
I was not able to get it done with lambda. You can try like this:
import datetime
today = datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)
#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)
# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)
print(frame_ordered)
Output:
#mask
0 False
4 False
2 False
3 True
1 False
Name: id, dtype: bool
#frame_ordered
id status date duration
0 1245 1 2022-07-01 25.0
4 1248 6 2022-01-03 204.0
2 2345 4 2022-04-20 97.0
3 4556 5 2022-02-02 38.0
1 4556 2 2022-03-12 136.0
I think that the values were not interpreted as pandas Timestamps. With the right conversion it should be easy though:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)
which prints out
id status date shifted Difference
0 1245 1 2022-07-01 NaT NaT
4 1248 6 2022-01-03 2022-07-01 -179 days
2 2345 4 2022-04-20 2022-01-03 107 days
3 4556 5 2022-02-02 2022-04-20 -77 days
1 4556 2 2022-03-12 2022-02-02 38 days
Lets say I have a dataframe like this
df1:
datetime1 datetime2
0 2021-05-09 19:52:14 2021-05-09 20:52:14
1 2021-05-09 19:52:14 2021-05-09 21:52:14
2 NaN NaN
3 2021-05-09 16:30:14 NaN
4 NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14
I want to compare the timestamps in datetime1 and datetime2 and create a new column with the difference between them.
In some scenarios I have a cases that I don't have values in datetime1 and datetime2, or I have values in datatime1 but I don't in datatime2, so is there is a possible way to get NaN in "difference" column if there is no timestamp in datetime1 and 2, and if there is a timestamp only in datetime1, get the difference compared to datetime.now() and put that in another column.
Desirable df output:
datetime1 datetime2 Difference in H:m:s Compared with datetime.now()
0 2021-05-09 19:52:14 2021-05-09 20:52:14 01:00:00 NaN
1 2021-05-09 19:52:14 2021-05-09 21:52:14 02:00:00 NaN
2 NaN NaN NaN NaN
3 2021-05-09 16:30:14 NaN NaN e.g(04:00:00)
4 NaN NaN NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14 02:00:00 NaN
I tried a solution from #AndrejKesely, but it is failing if there is no timestamp in datetime1 and datetime2:
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
# if datetime1/datetime2 aren't already datetime, apply `.to_datetime()`:
df["datetime1"] = pd.to_datetime(df["datetime1"])
df["datetime2"] = pd.to_datetime(df["datetime2"])
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["datetime2"] - x["datetime1"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
print(df)
Select only rows match conditions by using boolean indexing (mask) to do what you need and let Pandas fill missing values with NaN:
def strfdelta(td: pd.Timestamp):
seconds = td.total_seconds()
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
seconds = int(seconds % 60)
return f"{hours:02}:{minutes:02}:{seconds:02}"
bm1 = df["datetime1"].notna() & df["datetime2"].notna()
bm2 = df["datetime1"].notna() & df["datetime2"].isna()
df["Difference in H:m:s"] = (df.loc[bm1, "datetime2"] - df.loc[bm1, "datetime1"]).apply(strfdelta)
df["Compared with datetime.now()"] = (datetime.now() - df.loc[bm2, "datetime1"]).apply(strfdelta)
>>> df
datetime1 datetime2 Diff... Comp...
0 2021-05-09 19:52:14 2021-05-09 20:52:14 01:00:00 NaN
1 2021-05-09 19:52:14 2021-05-09 21:52:14 02:00:00 NaN
2 NaT NaT NaN NaN
3 2021-05-09 16:30:14 NaT NaN 103:09:19
4 NaT NaT NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14 02:00:00 NaN
You could start by replacing all NaN values in the datetime2 column with datetime.now value. Thus it would make it easier to compare datetime1 to now if datetime1 is NaN.
You can do it with :
df["datetime2"] = df["datetime2"].fillna(value=pandas.to_datetime('today').normalize(),axis=1)
Then you hace only 2 conditions remaining :
If datetime1 column is empty, the result is NaN.
Otherwise, the result is the difference between datetime1 and datetime2 column (as there is no NaN remaining in datetime2 column).
You can perform this with :
import numpy as np
df["Difference in H:m:s"] = np.where(
df["datetime1"].isnull(),
pd.NA,
df["datetime2"] - df["datetime1"]
)
You can finally format your Difference in H:m:s in the required format with the function you provided :
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["Difference in H:m:s"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
The complete code is :
import numpy as np
# if datetime1/datetime2 aren't already datetime, apply `.to_datetime()`:
df["datetime1"] = pd.to_datetime(df["datetime1"])
df["datetime2"] = pd.to_datetime(df["datetime2"])
df["datetime2"] = df["datetime2"].fillna(value=pandas.to_datetime('today').normalize(),axis=1)
df["Difference in H:m:s"] = np.where(
df["datetime1"].isnull(),
pd.NA,
df["datetime2"] - df["datetime1"]
)
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["Difference in H:m:s"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
I have a dataframe like to following (df1):
index,col1,col2
2020-01-01,A,Y
2020-01-02,B,Z
And another like the following (df2):
index,date, .....
1,2020-01-01 13:44
2,2020-01-01 15:22
3,2020-01-01 23:11
4,2020-01-01 13:44
5,2020-01-02 13:28
6,2020-01-02 17:55
I need to map df2['date'] year, month and day with df1.index year, month and day to get a final dataframe like the following:
index,col1,col2
2020-01-01 13:44,A,Y
2020-01-01 15:22,A,Y
2020-01-01 23:11,A,Y
2020-01-01 13:44,A,Y
2020-01-02 13:28,B,Z
2020-01-02 17:55,B,Z
Something like the following would do the work:
pd.Dataframe(mapped_values, index=df2['date'], columns=df1.columns)
How can I get mapped_values here?
You can try merge:
df2['day'] = df2['date'].dt.normalize()
df2.merge(df1, left_on='day', right_index=True)
Output:
date day col1 col2
index
1 2020-01-01 13:44:00 2020-01-01 A Y
2 2020-01-01 15:22:00 2020-01-01 A Y
3 2020-01-01 23:11:00 2020-01-01 A Y
4 2020-01-01 13:44:00 2020-01-01 A Y
5 2020-01-02 13:28:00 2020-01-02 B Z
6 2020-01-02 17:55:00 2020-01-02 B Z
The following does it.
Modules
import pandas as pd
import io
Data
df1 = pd.read_csv(io.StringIO("""
index,col1,col2
2020-01-01,A,Y
2020-01-02,B,Z
"""), sep=",", engine="python")
df2 = pd.read_csv(io.StringIO("""
index,date
1,2020-01-01 13:44
2,2020-01-01 15:22
3,2020-01-01 23:11
4,2020-01-01 13:44
5,2020-01-02 13:28
6,2020-01-02 17:55
"""), sep=",", engine="python")
Date formatting
df1['ndate'] = pd.to_datetime(df1['index'])
df2['ndate'] = pd.to_datetime(df2['date'])
df2['ndate'] = pd.to_datetime(df2['ndate'].dt.strftime('%Y-%m-%d'))
Merge
pd.merge(df2, df1, on=['ndate'])
Having a dataframe like that:
Desirable result is to get aggregated IDs with time diffs between Start and End looking like that:
Tried simple groupings and diffs but it does not work:
df[df['Name'] == 'Start'].groupby('ID')['Time']-\
df[df['Name'] == 'End'].groupby('ID')['Time']
How this task can be done in pandas? Thanks!
A possible solution is to join the table on itself like this:
df_start = df[df['Name'] == 'Start']
df_end = df[df['Name'] == 'End']
df_merge = df_start.merge(df_end, on='id', suffixes=('_start', '_end'))
df_merge['diff'] = df_merge['Time_end'] - df_merge['Time_start']
print(df_merge.to_string())
Output:
id Name_start Time_start Name_end Time_end diff
0 1 Start 2017-11-02 12:00:14 End 2017-11-07 22:45:13 5 days 10:44:59
1 2 Start 2018-01-28 06:53:09 End 2018-02-05 13:31:14 8 days 06:38:05
Here you go.
Generate data:
df = pd.DataFrame({'ID':[1, 1,2, 2],
'Name': ['Start', 'End', 'Start', 'End'],
'Time': [pd.datetime(2020, 1,1,0,1,0), pd.datetime(2020, 1,2,0,0,0),
pd.datetime(2020, 1,1,0,0,0), pd.datetime(2020, 1,2,0,0,0)]})
Get TimeDelta:
df_agg = df[df['Name'] == 'Start'].reset_index()[['ID', 'Time']]
df_agg = df_agg.rename(columns={"Time": "Start"})
df_agg['End'] = df[df['Name'] == 'End'].reset_index()['Time']
df_agg['TimeDelta'] = df_agg['End'] - df_agg['Start']
Get timediff as decimal value in days, like your example:
df_agg['TimeDiff_days'] = df_agg['TimeDelta'] / np.timedelta64(1,'D')
df_agg
Result:
ID Start End TimeDelta TimeDiff_days
0 1 2020-01-01 00:01:00 2020-01-02 0 days 23:59:00 0.999306
1 2 2020-01-01 00:00:00 2020-01-02 1 days 00:00:00 1.000000
I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]