Error while subtracting 2 date columns in pandas - python

I have a dataframe and a function to get random dates..
from datetime import date, timedelta
import pandas as pd
import random
def dates(start_date, end_date):
start_date = date(start_date[0], start_date[1], start_date[2])
end_date = date(end_date[0], end_date[1], end_date[2])
days_delta = (end_date - start_date).days
return start_date + timedelta(days=random.randrange(days_delta))
df = pd.DataFrame(index=range(100))
df['MOVE_OUT_DATE'] = date(9999, 12, 31)
df['MOVE_IN_DATE'] = [dates((2021, 1, 1), (2021, 6, 30)) for _ in range(df.shape[0])]
To get the difference in days I do this,
df['days_diff'] = df['MOVE_OUT_DATE'] - df['MOVE_IN_DATE']
and this works fine in VS Code. But it throws a "Python int too large to convert to C long" in Databricks. A screenshot of error is attached below,
Any help or suggestion is appreciated. Thank you.

I was able to get everything to work and I believe it is what you are trying to accomplish with your code
df = pd.DataFrame(pd.date_range('2021-01-01', '2021-06-01', freq = 'D'), columns = ['START_DATE'])
df['MOVE_OUT_DATE'] = '2260-12-31'
df['START_DATE'] = pd.to_datetime(df['START_DATE'])
df['MOVE_OUT_DATE'] = pd.to_datetime(df['MOVE_OUT_DATE'])
df['DAYS_DIFF'] = df['MOVE_OUT_DATE'] - df['START_DATE']
df
However, if you notice the 'MOVE_OUT_DATE' is only set to 2060 as anything long than that produced an error as the being to long. Could you take this and generate the results you want (if you converted it into a function)?

Related

Python Network Hours

Really struggling with this one so any help would be much appreciated.
GOAL - workout the hours between two datetime columns excluding weekends and only taking the hours between the working times of 9 & 17.
Now I have reused a function that I use for network days but the output is wrong and I can't seem to figure out how to get it working.
As an example I have In my data a start date and end date that are as follows
Start_Date = 2017-07-11 19:33:00
End_Date = 2017/07/12 12:01:00
and the output I'm after is
3.02
However the function I do have is returning 16!
Function below -
start = pd.Series(start)
end = pd.Series(end)
mask = (pd.notnull(start) & pd.notnull(end)) & (start.dt.hour >= 9) & (end.dt.hour <= 17) & (start.dt.weekday < 5) & (end.dt.weekday < 5)
result = np.empty(len(start), dtype=float)
result.fill(np.nan)
result[mask] = np.where((start[mask].dt.hour >= 9) & (end[mask].dt.hour <= 17), (end[mask] - start[mask]).astype('timedelta64[h]').astype(float), 0)
return result ```
It looks like what you need is businesstimedelta
import datetime
import businesstimedelta
start = datetime.datetime.strptime("2017-07-11 19:33:00", "%Y-%m-%d %H:%M:%S")
end = datetime.datetime.strptime("2017-07-12 12:01:00", "%Y-%m-%d %H:%M:%S")
# Define a working day rule
workday = businesstimedelta.WorkDayRule(
start_time=datetime.time(9),
end_time=datetime.time(17),
working_days=[0, 1, 2, 3, 4])
businesshours = businesstimedelta.Rules([workday])
# Calculate the difference
diff = businesshours.difference(start, end)
print(diff)
Output:
<BusinessTimeDelta 3 hours 60 seconds>
https://pypi.org/project/businesstimedelta/
So, I really struggled finding out how to apply the above into a function but finally after much banging of the head came up with the below. Sharing for the next person in my situation. I wanted to convert to minutes so if not required just remove the *60 at the return
import datetime
import businesstimedelta
# Define a working day
workday = businesstimedelta.WorkDayRule(
start_time=datetime.time(9),
end_time=datetime.time(17),
working_days=[0, 1, 2, 3, 4])
# Combine the two
businesshrs = businesstimedelta.Rules([workday])
def business_Mins(df, start, end):
try:
mask = pd.notnull(df[start]) & pd.notnull(df[end])
result = np.empty(len(df), dtype=object)
result[mask] = df.loc[mask].apply(lambda x: businesshrs.difference(x[start],x[end]).hours, axis=1)
result[~mask] = np.nan
return result * 60
except KeyError as e:
print(f"Error: One or more columns not found in the dataframe - {e}")
return None
df['Contact_SLA'] = business_Mins(df, 'Date and Time of Instruction', 'Date and Time of Attempted Contact')

How to exclude weekends and holidays from finding the difference between two dates in python

I need to find the difference between 2 dates where certain end dates are blank. I am need to exclude the weekends, as well as the holidays when calculating the dates. I also need to put into account the blank end_dates.
I have a data frame which looks like:
start_date
end_date
01-01-2020
05-01-2020
30-10-2021
NaT
15-08-2019
NaT
29-06-2020
15-07-2020
The code for retrieving the holidays I wrote as the following:
df = read_excel(r'dates.xlsx')
df.head()
us_holidays = holidays.UnitesStates()
The following code works around the null values and it excludes the weekends
def business_days(start, end):
mask = pd.notnull(start) & pd.notnull(end)
start = start.values.astype('datetime64[D]')[mask]
end = end.values.astype('datetime64[D]')[mask]
holi = us_holidays.values.astype('datetime64[D]')[mask]
result = np.empty(len(mask), dtype=float)
result[mask] = np.busday_count(start, end, holidays= holi)
result[~mask] = np.nan
return result
df['count'] = business_days(df['start_date'], df['end_date'])
The error I get is:
AttributeError: 'builtin_function_or_method' object has no attribute 'astype'
How can I fix the following error?
Any help will be greatly appreciated, thanks.
I'm not familiar with the holiday package. But holidays.UnitesStates() seems to return an object and not the needed
list of dates. However you can create a list of holiday dates for a certain range of years.
I'm not sure why you get "NaT", ususally you get NaNs. But you can handle both.
One way to do it:
import holidays
import pandas as pd
import numpy as np
import datetime
#Create Dummy DataFrame:
df = pd.DataFrame(columns=['start_date','end_date'])
df['start_date'] = np.array(["2020-01-01","2021-10-30","2019-08-15","2020-06-29"])
df['end_date'] = np.array(["2020-01-05","NaT","NaT", "2020-07-15"])
#Convert Columns to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
#Convert DateTime to Date
df['start_date'] = df['start_date'].dt.date
df['end_date'] = df['end_date'].dt.date
#Not sure why you get NaT when you read the file with pandas. So replace it with today:
df = df.replace({'NaT': datetime.date.today()})
#In case you get a NaN:
df = df.fillna(datetime.date.today())
#Get First and Last Year
max_year = df.max().max().year
min_year = df.min().min().year
#holidays.US returns an object, so you have to create a list
us_holidays = list()
for date,name in sorted(holidays.US(years=list(range(min_year,max_year+1))).items()):
us_holidays.append(date)
start_dates = list(df['start_date'].values)
end_dates = list(df['end_date'].values)
df['count'] = np.busday_count(start_dates, end_dates, holidays = us_holidays)

Create new column based on multiple conditions of existing column while manipulating existing column

I am new to Python/pandas coming from an R background. I am having trouble understanding how I can manipulate an existing column to create a new column based on multiple conditions of the existing column. There are 10 different conditions that need to met but for simplicity I will use a 2 case scenario.
In R:
install.packages("lubridate")
library(lubridate)
df <- data.frame("Date" = c("2020-07-01", "2020-07-15"))
df$Date <- as.Date(df$Date, format = "%Y-%m-%d")
df$Fiscal <- ifelse(day(df$Date) > 14,
paste0(year(df$Date),"-",month(df$Date) + 1,"-01"),
paste0(year(df$Date),"-",month(df$Date),"-01")
)
df$Fiscal <- as.Date(df$Fiscal, format = "%Y-%m-%d")
In Python I have:
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst = True, format = "%Y-%m-%d")
df.loc[df['Date'].dt.day > 14,
'Fiscal'] = "-".join([dt.datetime.strftime(df['Date'].dt.year), dt.datetime.strftime(df['Date'].dt.month + 1),"01"])
df.loc[df['Date'].dt.day <= 14,
'Fiscal'] = "-".join([dt.datetime.strftime(df['Date'].dt.year), dt.datetime.strftime(df['Date'].dt.month),"01"])
If I don't convert the 'Date' field it says that it expects a string, however if I do convert the date field, I still get an error as it seems it is applying to a 'Series' object.
TypeError: descriptor 'strftime' for 'datetime.date' objects doesn't apply to a 'Series' object
I understand I may have some terminology or concepts incorrect and apologize, however the answers I have seen dealing with creating a new column with multiple conditions do not seem to be manipulating the existing column they are checking the condition on, and simply taking on an assigned value. I can only imagine there is a more efficient way of doing this that is less 'R-ey' but I am not sure where to start.
This isn't intended as a full answer, just as an illustration how strftime works: strftime is a method of a date(time) object that takes a format-string as argument:
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst = True, format = "%Y-%m-%d")
s = [dt.date(df['Date'][i].year, df['Date'][i].month + 1, 1).strftime('%Y-%m-%d')
for i in df['Date'].index]
print(s)
Result:
['2020-08-01', '2020-08-01']
Again: No full answer, just a hint.
EDIT: You can vectorise this, for example by:
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst=True, format='%Y-%m-%d')
df['Fiscal'] = df['Date'].apply(lambda d: dt.date(d.year, d.month, 1)
if d.day < 15 else
dt.date(d.year, d.month + 1, 1))
print(df)
Result:
Date Fiscal
0 2020-07-01 2020-07-01
1 2020-07-15 2020-08-01
Here I'm using an on-the-fly lambda function. You could also do it with an externally defined function:
def to_fiscal(date):
if date.day < 15:
return dt.date(date.year, date.month, 1)
return dt.date(date.year, date.month + 1, 1)
df['Fiscal'] = df['Date'].apply(to_fiscal)
In general vectorisation is better than looping over rows because the looping is done on a more "lower" level and that is much more efficient.
Until someone tells me otherwise I will do it this way. If there's a way to do it vectorized (or just a better way in general) I would greatly appreciate it
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst=True, format='%Y-%m-%d')
test_list = list()
for i in df['Date'].index:
mth = df['Date'][i].month
yr = df['Date'][i].year
dy = df['Date'][i].day
if(dy > 14):
new_date = dt.date(yr, mth + 1, 1)
else:
new_date = dt.date(yr, mth, 1)
test_list.append(new_date)
df['New_Date'] = test_list

How to convert datetime to timestamp and calculate difference between dates using lambda function

I need to convert a variable i created into a timestamp from a datetime.
I need it in a timestamp format to perform a lambda function against my pandas series, which is stored as a datetime64.
The lambda function should find the difference in months between startDate and the entire pandas series. Please help?
I've tried using relativedelta to calculate the difference in months but I'm not sure how to implement it with a pandas series.
from datetime import datetime
import pandas as pd
from dateutil.relativedelta import relativedelta as rd
#open the data set and store in the series ('df')
file = pd.read_csv("test_data.csv")
df = pd.DataFrame(file)
#extract column "AccountOpenedDate into a data frame"
open_date_data = pd.Series.to_datetime(df['AccountOpenedDate'], format = '%Y/%m/%d')
#set the variable startDate
dateformat = '%Y/%m/%d %H:%M:%S'
set_date = datetime.strptime('2017/07/01 00:00:00',dateformat)
startDate = datetime.timestamp(set_date)
#This function calculates the difference in months between two dates: ignore
def month_delta(start_date, end_date):
delta = rd(end_date, start_date)
# >>> relativedelta(years=+2, months=+3, days=+28)
return 12 * delta.years + delta.months
d1 = datetime(2017, 7, 1)
d2 = datetime(2019, 10, 29)
total_months = month_delta(d1, d2)
# Apply a lambda function to each row by adding 5 to each value in each column
dfobj = open_date_data.apply(lambda x: x + startDate)
print(dfobj)
I'm only using a single column from the loaded data set. It's a date column in the following format ("%Y/%m/%d %H:%M:%S"). I want to find the difference in months between startDate and all the dates in the series.
As I don't have your original csv, I've made up some sample data and hopefully managed to shorten your code quite a bit:
open_date_data = pd.Series(pd.date_range('2017/07/01', periods=10, freq='M'))
startDate = pd.Timestamp("2017/07/01")
Then, with help from this answer to get the appropriate month_diff formula:
def month_diff(a, b):
return 12 * (a.year - b.year) + (a.month - b.month)
open_date_data.apply(lambda x: month_diff(x, startDate))

Python error: cannot add integral value to Timestamp without freq

I am trying calculate the difference between two dates to get a number that is an integer difference (in days) between the two dates, but I get the following error: "Cannot add integral value to Timestmp without freq". Here is the code:
from __future__ import print_function
try:
import argparse
flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
flags = None
import os
import datetime
import pandas_datareader.data as web
import numpy as np
import pandas as pd
def main():
count = 0
df = pd.DataFrame([])
start = datetime.datetime(2017, 10, 11)
end = datetime.datetime(2017, 10, 27)
index_date = datetime.datetime(2017, 10, 11)
symbols_list = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
length = len(symbols_list)
for num, ticker in enumerate(symbols_list, start=1):
f = web.DataReader(ticker, 'yahoo', start, end)['Adj Close']
f.ix[index_date]
if count == 0:
f = f.to_frame().reset_index()
df = f
df.columns = ['Date', ticker]
length_df = len(df)
sDate = df.iloc[:,-2] # Date data list
print ('sDate[0] is: ', (sDate[0]))
j = 0
while j < len(sDate[j] - 1):
date_delta = timedelta(sDate[j] - index_date)
j += 1
It crashes at the last line:
date_delta = timedelta(sDate[j] - index_reference_date)
The error message is: "Cannot add integral value to Timestmp without freq".
I cannot understand what the problem is. The data types are:
sDate[0] is: 2017-10-06 00:00:00, and
index_date is: 2017-10-11 00:00:00
index_date type is: <type 'datetime.datetime'>
But note that:
sDate[0] type is: <class 'pandas._libs.tslib.Timestamp'>
So: Maybe the problem is here? Thanks for any help!
There is a typing error on this line:
while j < len(sDate[j] - 1):
sDate is a date data list, thus sDate[j] is a date (probably of type pandas.tslib.Timestamp) and it's length does not make sense. So you probably want something like:
while j < len(sDate) - 1:
Maybe it's more appropriate to use a for loop, something like:
for dat in sDate[:-1]:
Edit: and then you need the thinks I wrote to the first answer.
The important thing may be the type of the difference sDate[j] - index_reference_date and how to pass it to timedelta constructor.
I believe this could be the solution:
date_delta = timedelta(microseconds=(sDate[j] - index_reference_date).delta)

Categories

Resources