I have a pandas DataFrame with the following content:
df =
start end
01/April 02/May
12/April 12/April
I need to add a column with the difference (in days) between end and start values (end - start).
How can I do it?
I tried the following:
import pandas as pd
df.startdate = pd.datetime(df.start, format='%B/%d')
df.enddate = pd.datetime(df.end, format='%B/%d')
But not sure if this is a right direction.
import pandas as pd
df = pd.DataFrame({"start":["01/April", "12/April"], "end": ["02/May", "12/April"]})
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
df["diff"] = (df["end"] - df["start"])
Output:
end start diff
0 2018-05-02 2018-04-01 31 days
1 2018-04-12 2018-04-12 0 days
This is one way.
df['start'] = pd.to_datetime(df['start']+'/2018', format='%d/%B/%Y')
df['end'] = pd.to_datetime(df['end']+'/2018', format='%d/%B/%Y')
df['diff'] = df['end'] - df['start']
# start end diff
# 0 2018-04-01 2018-05-02 31 days
# 1 2018-04-12 2018-04-12 0 days
Related
I'd like to get the difference between the end and start date columns, inclusive
df = pd.DataFrame()
df['start'] = pd.to_datetime(['1/1/2020','1/2/2020'])
df['end'] = pd.to_datetime(['1/31/2020', '1/25/2020'])
df['diff'] = df['end'] - df['start']
So instead of
start end diff
0 2020-01-01 2020-01-31 30 days
1 2020-01-02 2020-01-25 23 days
I want to get 31 and 24 days. I can solve it by adding a 1 day Timedelta, but it seems a bit fragile. Is there any other way>
df['diff'] = df['end'] - df['start'] + pd.Timedelta(days=1)
It's exactly what you need. This is how I use it when working on dates
pd.Timedelta(days=1).
I am trying to get the delta in months between a starting date and an ending date within Pandas DataFrame. The result is not totally satisfying...
First, the outcome is some sort of Datetime type in the form of <[value] * MonthEnds>. I can't use this to calculate with. First question is how to convert this to an integer. I tried the .n attribute but then I get the following error:
AttributeError: 'Series' object has no attribute 'n'
Second, the outcome is 'missing' one month. Can this be avoided by using another solution/method? Or should I just add 1 month to the answer?
To support my questions I created some simplified code:
dates = [{'Start':'1-1-2020', 'End':'31-10-2020'}, {'Start':'1-2-2020', 'End':'30-11-2020'}]
df = pd.DataFrame(dates)
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
df['Duration'] = (df['End'].dt.to_period('M') - df['Start'].dt.to_period('M'))
df
This results in:
Start End Duration
0 2020-01-01 2020-10-31 <9 * MonthEnds>
1 2020-02-01 2020-11-30 <9 * MonthEnds>
The preferred result would be:
Start End Duration
0 2020-01-01 2020-10-31 10
1 2020-02-01 2020-11-30 10
Subtract the start-date from the end-date and convert the time delta to months.
import pandas as pd
dates = [{'Start':'1-1-2020', 'End':'31-10-2020'}, {'Start':'1-2-2020', 'End':'30-11-2020'}]
df = pd.DataFrame(dates)
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
df['Duration'] = (df['End']-df['Start']).astype('<m8[M]').astype(int)+1
print(df)
Output:
Start End Duration
0 2020-01-01 2020-10-31 10
1 2020-02-01 2020-11-30 10
Try This
dates = [{'Start':'1-1-2020', 'End':'31-10-2020'}, {'Start':'1-2-2020', 'End':'30-11-2020'}]
df = pd.DataFrame(dates)
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
df['Duration'] = (df['End'] - df['Start']).apply(lambda x:x.days//30)
print(df)
I have a df like the following:
import datetime as dt
import pandas as pd
import pytz
cols = ['utc_datetimes', 'zone_name']
data = [
['2019-11-13 14:41:26,2019-12-18 23:04:12', 'Europe/Stockholm'],
['2019-12-06 21:49:04,2019-12-11 22:52:57,2019-12-18 20:30:58,2019-12-23 18:49:53,2019-12-27 18:34:23,2020-01-07 21:20:51,2020-01-11 17:36:56,2020-01-20 21:45:47,2020-01-30 20:48:49,2020-02-03 21:04:52,2020-02-07 20:05:02,2020-02-10 21:07:21', 'Europe/London']
]
df = pd.DataFrame(data, columns=cols)
print(df)
# utc_datetimes zone_name
# 0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm
# 1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London
And I would like to count the number of nights and Wednesdays, of the row's local time, the dates in the df represent. This is the desired output:
utc_datetimes zone_name nights wednesdays
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 0 1
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 11 2
I've come up with the following double for loop, but it is not as efficient as I'd like it for the sizable df:
# New columns.
df['nights'] = 0
df['wednesdays'] = 0
for row in range(df.shape[0]):
date_list = df['utc_datetimes'].iloc[row].split(',')
user_time_zone = df['zone_name'].iloc[row]
for date in date_list:
datetime_obj = dt.datetime.strptime(
date, '%Y-%m-%d %H:%M:%S'
).replace(tzinfo=pytz.utc)
local_datetime = datetime_obj.astimezone(pytz.timezone(user_time_zone))
# Get day of the week count:
if local_datetime.weekday() == 2:
df['wednesdays'].iloc[row] += 1
# Get time of the day count:
if (local_datetime.hour >17) & (local_datetime.hour <= 23):
df['nights'].iloc[row] += 1
Any suggestions will be appreciated :)
PD. disregard the definition of 'night', just an example.
One way is to first create a reference df by exploding your utc_datetimes column and then get the TimeDelta for each zone:
df = pd.DataFrame(data, columns=cols)
s = (df.assign(utc_datetimes=df["utc_datetimes"].str.split(","))
.explode("utc_datetimes"))
s["diff"] = [pd.Timestamp(a, tz=b).utcoffset() for a,b in zip(s["utc_datetimes"],s["zone_name"])]
With this helper df you can calculate the number of wednesdays and nights:
df["wednesdays"] = (pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.day_name().eq("Wednesday").groupby(level=0).sum()
df["nights"] = ((pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.hour>17).groupby(level=0).sum()
print (df)
#
utc_datetimes zone_name wednesdays nights
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 1.0 0.0
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 2.0 11.0
Having a dataframe like that:
Desirable result is to get aggregated IDs with time diffs between Start and End looking like that:
Tried simple groupings and diffs but it does not work:
df[df['Name'] == 'Start'].groupby('ID')['Time']-\
df[df['Name'] == 'End'].groupby('ID')['Time']
How this task can be done in pandas? Thanks!
A possible solution is to join the table on itself like this:
df_start = df[df['Name'] == 'Start']
df_end = df[df['Name'] == 'End']
df_merge = df_start.merge(df_end, on='id', suffixes=('_start', '_end'))
df_merge['diff'] = df_merge['Time_end'] - df_merge['Time_start']
print(df_merge.to_string())
Output:
id Name_start Time_start Name_end Time_end diff
0 1 Start 2017-11-02 12:00:14 End 2017-11-07 22:45:13 5 days 10:44:59
1 2 Start 2018-01-28 06:53:09 End 2018-02-05 13:31:14 8 days 06:38:05
Here you go.
Generate data:
df = pd.DataFrame({'ID':[1, 1,2, 2],
'Name': ['Start', 'End', 'Start', 'End'],
'Time': [pd.datetime(2020, 1,1,0,1,0), pd.datetime(2020, 1,2,0,0,0),
pd.datetime(2020, 1,1,0,0,0), pd.datetime(2020, 1,2,0,0,0)]})
Get TimeDelta:
df_agg = df[df['Name'] == 'Start'].reset_index()[['ID', 'Time']]
df_agg = df_agg.rename(columns={"Time": "Start"})
df_agg['End'] = df[df['Name'] == 'End'].reset_index()['Time']
df_agg['TimeDelta'] = df_agg['End'] - df_agg['Start']
Get timediff as decimal value in days, like your example:
df_agg['TimeDiff_days'] = df_agg['TimeDelta'] / np.timedelta64(1,'D')
df_agg
Result:
ID Start End TimeDelta TimeDiff_days
0 1 2020-01-01 00:01:00 2020-01-02 0 days 23:59:00 0.999306
1 2 2020-01-01 00:00:00 2020-01-02 1 days 00:00:00 1.000000
I have a pandas dataframe with three columns. A start and end date and a month.
I would like to add a column for how many days within the month are between the two dates. I started doing something with apply, the calendar library and some math, but it started to get really complex. I bet pandas has a simple solution, but am struggling to find it.
Input:
import pandas as pd
df1 = pd.DataFrame(data=[['2017-01-01', '2017-06-01', '2016-01-01'],
['2015-03-02', '2016-02-10', '2016-02-01'],
['2011-01-02', '2018-02-10', '2016-03-01']],
columns=['start date', 'end date date', 'Month'])
Desired Output:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
There is a solution:
get a date list by pd.date_range between start and end dates, and then check how many date has the same year and month with the target month.
def overlap(x):
md = pd.to_datetime(x[2])
cand = [(ad.year, ad.month) for ad in pd.date_range(x[0], x[1])]
return len([x for x in cand if x ==(md.year, md.month)])
df1["Days in Month"]= df1.apply(overlap, axis=1)
You'll get:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
You can convert your cell to datetime by
df = df.applymap(lambda x: pd.to_datetime(x))
Then find intersection days with function
def intersectionDaysInMonth(start, end, month):
end_month = month.replace(month=month.month + 1)
if month <= start <= end_month:
return end_month - start
if month <= end <= end_month:
return end - month
if start <= month < end_month <= end:
return end_month - month
return pd.to_timedelta(0)
Then apply
df['Days in Month'] = df.apply(lambda row: intersectionDaysInMonth(*row).days, axis=1)