comparing rows data frame | shift and apply functions throwing exception - python

I am trying to derive a mean value for the average duration spent in a specific status by ID.
For this I first sort my data frame by ID and date, and with the apply and shift function trying to deduct the date of row[i+1] - row[i] - given row[i+1] - row[i] are for the same ID.
I get the following exception: AttributeError: 'int' object has no attribute 'shift'
Below a code for simulation:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)
Can anyone please advise how to solve the last line with the lambda function?

I was not able to get it done with lambda. You can try like this:
import datetime
today = datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)
#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)
# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)
print(frame_ordered)
Output:
#mask
0 False
4 False
2 False
3 True
1 False
Name: id, dtype: bool
#frame_ordered
id status date duration
0 1245 1 2022-07-01 25.0
4 1248 6 2022-01-03 204.0
2 2345 4 2022-04-20 97.0
3 4556 5 2022-02-02 38.0
1 4556 2 2022-03-12 136.0

I think that the values were not interpreted as pandas Timestamps. With the right conversion it should be easy though:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)
which prints out
id status date shifted Difference
0 1245 1 2022-07-01 NaT NaT
4 1248 6 2022-01-03 2022-07-01 -179 days
2 2345 4 2022-04-20 2022-01-03 107 days
3 4556 5 2022-02-02 2022-04-20 -77 days
1 4556 2 2022-03-12 2022-02-02 38 days

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Pandas groupby month output is incorrect [duplicate]

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Is there a quick way for checking whether a date lies within n days(say 7) from a list of dates

I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?
Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32

Python adding row into dataframe using while loop

I have a dataset like this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 B456 2019-10-01 2019-08-01 2019-09-01
3 B456 2019-10-01 2019-09-01 2019-10-01
generated by this code:
from pandas import DataFrame
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
I'm trying to write a function to achieve this:
user_id lapsed_date start_date end_date
0 A123 2020-01-02 2019-01-02 2019-02-02
1 A123 2020-01-02 2019-02-02 2019-03-02
2 A123 2020-01-02 2019-03-02 2019-04-02
3 A123 2020-01-02 2019-04-02 2019-05-02
4 A123 2020-01-02 2019-05-02 2019-06-02
5 A123 2020-01-02 2019-06-02 2019-07-02
6 A123 2020-01-02 2019-07-02 2019-08-02
7 A123 2020-01-02 2019-08-02 2019-09-02
8 A123 2020-01-02 2019-09-02 2019-10-02
9 A123 2020-01-02 2019-10-02 2019-11-02
10 A123 2020-01-02 2019-11-02 2019-12-02
11 A123 2020-01-02 2019-12-02 2020-01-02
12 B456 2019-10-01 2019-08-01 2019-09-01
13 B456 2019-10-01 2019-09-01 2019-10-01
Essentially the function should keep adding row, for each user_id while the max(end_date) is less than or equal to lapsed_date. The newly added row will take previous row's end_date as start_date, and previous row's end_date + 1 month as end_date.
I have generated this function below.
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
It works with all the logic above, except the while loop doesn't work. So I have to apply this function using this apply command manually 10 times:
df = df.groupby('user_id').apply(add_row).reset_index(drop = True)
I'm not really sure what I did wrong with the while loop there. Any advice would be highly appreciated!
So there are a few reasons your loop did not work, I will explain them as we go!
def add_row(x):
while x['end_date'].max() < x['lapsed_date'].max():
next_month = x['end_date'].max() + pd.DateOffset(months=1)
last_row = x.iloc[-1]
last_row['start_date'] = x['end_date'].max()
last_row['end_date'] = next_month
return x.append(last_row)
return x
In the above, you call return which returns the result to the code that called the function. This essentially stops your loop from iterating multiple times and returns the result of the first append.
return x.append(last_row) Another caveat here is that dataframe.append() does not actually append to the dataframe, you need to call x = x.append(last_row)
Pandas Append
Secondly, I noted that it may be required to do this over multiple, unique user_id rows. Due to this, in the code below, I have split the dataframe into multiple frames, dictated by the total unique user_id's stored in the frame.
Here is how you can get this to work;
import pandas as pd
from pandas import DataFrame
def add_row(df):
while df['end_date'].max() < df['lapsed_date'].max():
new_row = {'user_id': df['user_id'][0],
'lapsed_date': df['lapsed_date'].max(),
'start_date': df['end_date'].max(),
'end_date': df['end_date'].max() + pd.DateOffset(months=1),
}
df = df.append(new_row, ignore_index = True)
return df ## Note the return is called OUTSIDE of the while loop, ensuring only the final result is returned.
sample = {'user_id': ['A123','A123','B456','B456'],
'lapsed_date': ['2020-01-02', '2020-01-02', '2019-10-01', '2019-10-01'],
'start_date' : ['2019-01-02', '2019-02-02', '2019-08-01', '2019-09-01'],
'end_date' : ['2019-02-02', '2019-03-02', '2019-09-01', '2019-10-01']
}
df = pd.DataFrame(sample,columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
df['lapsed_date'] = pd.to_datetime(df['lapsed_date'])
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
ids = df['user_id'].unique()
g = df.groupby(['user_id'])
result = pd.DataFrame(columns= ['user_id', 'lapsed_date', 'start_date', 'end_date'])
for i in ids:
group = g.get_group(i)
result = result.append(add_row(group), ignore_index=True)
print(result)
Split the frames based on unique user id's
Create empty data frame to store result in under result
Iterate over all user_id's
Run the same while loop, ensuring that df is updated with the append rows
Return the result and print
Hope this helps!

Add months to a date in Pandas

I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]

Categories

Resources