I'm trying to get number of days between two dates using below function
df['date'] = pd.to_datetime(df.date)
# Creating a function that returns the number of days
def calculate_days(date):
today = pd.Timestamp('today')
return today - date
# Apply the function to the column date
df['days'] = df['date'].apply(lambda x: calculate_days(x))
The results looks like this
153 days 10:16:46.294037
but I want it to say 153. How do I handle this?
For performance you can subtract values without apply for avoid loops use Series.rsub for subtract from rigth side:
df['date'] = pd.to_datetime(df.date)
df['days'] = df['date'].rsub(pd.Timestamp('today')).dt.days
What working like:
df['days'] = (pd.Timestamp('today') - df['date']).dt.days
If want use your solution:
df['date'] = pd.to_datetime(df.date)
def calculate_days(date):
today = pd.Timestamp('today')
return (today - date).days
df['days'] = df['date'].apply(lambda x: calculate_days(x))
Or:
df['date'] = pd.to_datetime(df.date)
def calculate_days(date):
today = pd.Timestamp('today')
return (today - date)
df['days'] = df['date'].apply(lambda x: calculate_days(x)).dt.days
df['date'] = pd.to_datetime(df.date)
a) pandas
(pd.Timestamp("today") - df.date).days
b) this numpy build function allows you to select a weekmask
np.busday_count(df.date.date(), pd.Timestamp("today").date(), weekmask=[1,1,1,1,1,1,1])
Related
I have the following DF :
Date
01/07/2022
10/07/2022
20/07/2022
The date x is
12/07/2022
So basically the function should return
10/07/2022
I am trying to avoid looping over the whole column but I don't know how to specify that I want the max date before a given date.
max(DF['Dates']) #Returns 20/07/2022
Try this:
d = '12/07/2022'
f = '%d/%m/%Y'
(pd.to_datetime(df['Date'],format=f)
.where(lambda x: x.lt(pd.to_datetime(d,format=f)))
.max())
You can filter dates by index:
df[df.Date < pd.to_datetime('12/07/2022')]
Then find max:
max(df[df.Date < pd.to_datetime('12/07/2022')].Date)
# Setting some stuff up
Date = ["01/07/2022", "10/07/2022", "20/07/2022"]
df = pd.DataFrame({"Date":Date})
df.Date = pd.to_datetime(df.Date, format='%d/%m/%Y')
target_date = pd.to_datetime("12/07/2022", format='%d/%m/%Y')
df = df.sort_values(by=["Date"]) # Sort by date
# Find all dates that are before target date, then choose the last one (i.e. the most recent one)
df.Date[df.Date < target_date][-1:].dt.date.values[0]
Output:
datetime.date(2022, 7, 10)
I am new to Python/pandas coming from an R background. I am having trouble understanding how I can manipulate an existing column to create a new column based on multiple conditions of the existing column. There are 10 different conditions that need to met but for simplicity I will use a 2 case scenario.
In R:
install.packages("lubridate")
library(lubridate)
df <- data.frame("Date" = c("2020-07-01", "2020-07-15"))
df$Date <- as.Date(df$Date, format = "%Y-%m-%d")
df$Fiscal <- ifelse(day(df$Date) > 14,
paste0(year(df$Date),"-",month(df$Date) + 1,"-01"),
paste0(year(df$Date),"-",month(df$Date),"-01")
)
df$Fiscal <- as.Date(df$Fiscal, format = "%Y-%m-%d")
In Python I have:
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst = True, format = "%Y-%m-%d")
df.loc[df['Date'].dt.day > 14,
'Fiscal'] = "-".join([dt.datetime.strftime(df['Date'].dt.year), dt.datetime.strftime(df['Date'].dt.month + 1),"01"])
df.loc[df['Date'].dt.day <= 14,
'Fiscal'] = "-".join([dt.datetime.strftime(df['Date'].dt.year), dt.datetime.strftime(df['Date'].dt.month),"01"])
If I don't convert the 'Date' field it says that it expects a string, however if I do convert the date field, I still get an error as it seems it is applying to a 'Series' object.
TypeError: descriptor 'strftime' for 'datetime.date' objects doesn't apply to a 'Series' object
I understand I may have some terminology or concepts incorrect and apologize, however the answers I have seen dealing with creating a new column with multiple conditions do not seem to be manipulating the existing column they are checking the condition on, and simply taking on an assigned value. I can only imagine there is a more efficient way of doing this that is less 'R-ey' but I am not sure where to start.
This isn't intended as a full answer, just as an illustration how strftime works: strftime is a method of a date(time) object that takes a format-string as argument:
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst = True, format = "%Y-%m-%d")
s = [dt.date(df['Date'][i].year, df['Date'][i].month + 1, 1).strftime('%Y-%m-%d')
for i in df['Date'].index]
print(s)
Result:
['2020-08-01', '2020-08-01']
Again: No full answer, just a hint.
EDIT: You can vectorise this, for example by:
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst=True, format='%Y-%m-%d')
df['Fiscal'] = df['Date'].apply(lambda d: dt.date(d.year, d.month, 1)
if d.day < 15 else
dt.date(d.year, d.month + 1, 1))
print(df)
Result:
Date Fiscal
0 2020-07-01 2020-07-01
1 2020-07-15 2020-08-01
Here I'm using an on-the-fly lambda function. You could also do it with an externally defined function:
def to_fiscal(date):
if date.day < 15:
return dt.date(date.year, date.month, 1)
return dt.date(date.year, date.month + 1, 1)
df['Fiscal'] = df['Date'].apply(to_fiscal)
In general vectorisation is better than looping over rows because the looping is done on a more "lower" level and that is much more efficient.
Until someone tells me otherwise I will do it this way. If there's a way to do it vectorized (or just a better way in general) I would greatly appreciate it
import pandas as pd
import datetime as dt
df = {'Date': ['2020-07-01', '2020-07-15']}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'], yearfirst=True, format='%Y-%m-%d')
test_list = list()
for i in df['Date'].index:
mth = df['Date'][i].month
yr = df['Date'][i].year
dy = df['Date'][i].day
if(dy > 14):
new_date = dt.date(yr, mth + 1, 1)
else:
new_date = dt.date(yr, mth, 1)
test_list.append(new_date)
df['New_Date'] = test_list
I'm trying to compare 10 years of data. I would like to remove the 'year' from the datetime, so I can plot each January on top of each other.
I've tried the following
df_data = pd.read_csv("P11-B2.csv", skiprows=[i for i in range(1,35)], usecols=[1,2,4])
df = pd.DataFrame(columns = ['Datetime', 'FH'])
df1 = pd.to_datetime(df_data['YYYYMMDD'], format='%Y%m%d')
df2 = df_data[' HH'].astype('timedelta64[h]')
df['Datetime'] = df1 + df2
df['FH'] = pd.to_numeric(df_data[' FH'], errors ='coerce')
del df1
del df2
del df_data
df['month'] = pd.DatetimeIndex(df['Datetime']).month
df100 = pd.to_datetime(df['month'], format='%m')
df['day'] = pd.DatetimeIndex(df['Datetime']).day
df101 = pd.to_datetime(df['day'], format='%d')
df['hour'] = pd.DatetimeIndex(df['Datetime']).hour
df102 = df['hour'].astype('timedelta64[h]')
df['year'] = 1900
df104 = pd.to_datetime(df['year'], format='%Y')
#df['DATE'] = df104 + df100 + df101 + df102
df['DATE'] = df['year'] + df['month'] + df['day'] + df['hour']
Though this returns an integer.
Is there a different way to only remove the year and keep the %m%d%H format?
Or is there a simple way to override the x-axis and use the integer?
This is what i would like to plot
I want to make a plot for each month, showing different lines for each year.
I have a python dataframe with 2 columns that contain dates as strings e.g. start_date "2002-06-12" and end_date "2009-03-01". I would like to calculate the difference (days) between these 2 columns for each row and save the results into a new column called for example time_diff of type float.
I have tried:
df["time_diff"] = (pd.Timestamp(df.end_date) - pd.Timestamp(df.start_date )).astype("timedelta64[d]")
pd.to_numeric(df["time_diff"])
based on some tutorials but this gives TypeError: Cannot convert input for the first line. What do I need to change to get this running?
Here is a working example of converting a string column of a dataframe to datetime type and saving the time difference between the datetime columns in a new column as a float data type( number of seconds)
import pandas as pd
from datetime import timedelta
tmp = [("2002-06-12","2009-03-01"),("2016-04-28","2022-03-14")]
df = pd.DataFrame(tmp,columns=["col1","col2"])
df["col1"]=pd.to_datetime(df["col1"])
df["col2"]=pd.to_datetime(df["col2"])
df["time_diff"]=df["col2"]-df["col1"]
df["time_diff"]=df["time_diff"].apply(timedelta.total_seconds)
Time difference in seconds can be converted to minutes or days by using simple math.
Try:
import numpy as np
enddates = np.asarray([pd.Timestamp(end) for end in df.end_date.values])
startdates = np.asarray([pd.Timestamp(start) for start in df.start_date.values])
df['time_diff'] = (enddates - startdates).astype("timedelta64")
First convert strings to datetime, then calculate difference in days.
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d')
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y-%m-%d')
df['time_diff'] = (df.end_date - df.start_date).dt.days
You can also do it by converting your columns into date and then computing the difference :
from datetime import datetime
df = pd.DataFrame({'Start Date' : ['2002-06-12', '2002-06-12' ], 'End date' : ['2009-03-01', '2009-03-06']})
df['Start Date'] = [ datetime.strptime(x, "%Y-%m-%d").date() for x in df['Start Date'] ]
df['End date'] = [ datetime.strptime(x, "%Y-%m-%d").date() for x in df['End date'] ]
df['Diff'] = df['End date'] - df['Start Date']
Out :
End date Start Date Diff
0 2009-03-01 2002-06-12 2454 days
1 2009-03-06 2002-06-12 2459 days
You should just use pd.to_datetime to convert your string values:
df["time_diff"] = (pd.to_datetime(df.end_date) - pd.to_datetime(df.start_date))
The resul will automatically be a timedelta64
You can try this :
df = pd.DataFrame()
df['Arrived'] = [pd.Timestamp('01-04-2017')]
df['Left'] = [pd.Timestamp('01-06-2017')]
diff = df['Left'] - df['Arrived']
days = pd.Series(delta.days for delta in (diff)
result = days[0]
I am using pandas dataframe that is loaded with csv files along with dates in it. Lets say
Assigned Date
1/15/2019
Resolved Date
1/20/2019
I am calculating the differance
df0['ResDate'] = df0['Resolved Date'].apply(lambda t: pd.to_datetime(t).date())
df0['RepDate'] = df0['Assigned Date'].apply(lambda t: pd.to_datetime(t).date())
df0['Woda']=df0['ResDate']-df0['RepDate']
I am getting the correct differance but i need to subract the weekends in this.
How do i proceed.
Thanks
Use numpy.busday_count:
df0['Assigned Date'] = pd.to_datetime(df0['Assigned Date'])
df0['Resolved Date'] = pd.to_datetime(df0['Resolved Date'])
df0['Woda'] = [np.busday_count(b,a) for a, b in zip(df0['Resolved Date'],df0['Assigned Date'])]
You can use datetime module to find the difference between two days:
import datetime
d1 = datetime.datetime.strptime('2019-01-15', '%Y-%m-%d')
d2 = datetime.datetime.strptime('2019-01-20', '%Y-%m-%d')
diff_days = (d2 - d1).days
diff_weekdays = diff_days - (diff_days // 7) * 2
print(diff_weekdays)