For example I have a pandas dataframe of names and dates:
name date
0 Tom 2021-12-05
1 Sue 2021-11-22
2 Steve 2021-10-17
I'm trying to round each date up to the 25th (not the nearest) to look like this:
name date
0 Tom 2021-12-25
1 Sue 2021-11-25
2 Steve 2021-10-25
My most recent attempts looks like this:
df['date'] = df['date'].apply(lambda x: x['date'] + pd.to_timedelta(1, unit='d') if x['date'].dt.strftime('%d') != '25' else x['date'])
I think my issue stems from being able to first check what the 'day' is and then add days until the 25th day is satisfied. Any help would be greatly appreciated!
EDIT:
jezrael's answer solves the problem while also factoring for dates beyond the 25th and is concise.
I also found this to work as well:
from pandas._libs.tslibs.timedeltas import Timedelta
def next_date(input_date):
while (input_date.day != 25):
input_date = input_date + Timedelta("1 day")
return input_date
df['date'] = pd.to_datetime(df['date'].dt.date)
df['next_start_ship'] = df['date'].map(lambda x: next_date(x))
If need replace all days to 25 use:
df['date'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-25'))
Or:
df['date'] = df['date'].apply(lambda x: x.replace(day=25))
print (df)
name date
0 Tom 2021-12-25
1 Sue 2021-11-25
2 Steve 2021-10-25
If need replace only days until 25, else is added 1 month and also set to days=25 use:
print (df)
name date
0 Tom 2021-12-05
1 Sue 2021-11-30
2 Steve 2021-10-17
mask = df['date'].dt.day < 25
s = pd.to_datetime(df['date'].dt.strftime('%Y-%m-25'))
df['date'] = np.where(mask, s, s + pd.DateOffset(months=1))
print (df)
name date
0 Tom 2021-12-25
1 Sue 2021-12-25
2 Steve 2021-10-25
Related
I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
1
ENG
3 days 01:00:52
2
ENG
0 days 12:01:02
2
ENG
1 days 22:10:03
2
ENG
0 days 20:00:50
For each ID, I want to get:
avg_visit_ESP and avg_visit_ENG columns.
Average time visit with country_ID = ESP for each ID.
Average time visit with country_ID = ENG for each ID.
ID
avg_visit_ESP
avg_visit_ENG
0
10 days 12:03:00
5 days 10:02:00
1
1 days 03:02:00
(8 days 16:06:58) / 3
2
NaT
(3 days 06:11:55) / 3
I don't know how to specify in groupby a double grouping, first by ID and then by country_ID. If you can help me I would appreciate it.
P.S.: The date format of visit_time (timedelta), can perform addition and division without any apparent problem.
from datetime import datetime, timedelta
date1 = pd.to_datetime('2022-02-04 10:10:21', format='%Y-%m-%d %H:%M:%S')
date2 = pd.to_datetime('2022-02-05 20:15:41', format='%Y-%m-%d %H:%M:%S')
date3 = pd.to_datetime('2022-02-07 20:15:41', format='%Y-%m-%d %H:%M:%S')
sum1date = date2-date1
sum2date = date3-date2
sum3date = date3-date1
print((sum1date+sum2date+sum3date)/3)
(df.groupby(['ID', 'country_ID'])['visit_time']
.mean(numeric_only=False)
.unstack()
.add_prefix('avg_visit_')
)
should do the trick
>>> df = pd.read_clipboard(sep='\s\s+')
>>> df.columns = [s.strip() for s in df]
>>> df['visit_time'] = pd.to_timedelta(df['visit_time'])
>>> df.groupby(['ID', 'country_ID'])['visit_time'].mean(numeric_only=False).unstack().add_prefix('avg_visit_')
country_ID avg_visit_ENG avg_visit_ESP
ID
0 5 days 10:02:00 10 days 12:03:00
1 2 days 21:22:19.333333333 1 days 03:02:00
2 1 days 02:03:58.333333333 NaT
I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?
Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14
Assume I have two policy data like below.
enroll lapse
A 2010/2/1 2013/1/2
B 2012/3/1 2013/1/4
I would like to count the number who policies are ongoing at the beginning of the year.
enroll lapse year
A 2010/2/1 2013/1/2 2011/1/1
A 2010/2/1 2013/1/2 2012/1/1
A 2010/2/1 2013/1/2 2013/1/1
B 2012/3/1 2013/1/4 2013/1/1
and count these ongoing policies.
year num
2011 1
2012 1
2013 2
I guess I must use query method. but I couldnt figure out.
You need:
#convert columns to datetimes
df['enroll'] = pd.to_datetime(df['enroll'])
df['lapse'] = pd.to_datetime(df['lapse'])
For each row apply function for expand rows, reshape to Series and join to original df:
def f(x):
b = x['lapse'].year - x['enroll'].year
return (pd.Series(pd.date_range(x['enroll'], periods=b, freq='AS')))
s = df.apply(f, axis=1).stack().reset_index(level=1, drop=True).rename('year')
df = df.join(s)
print (df)
enroll lapse year
A 2010-02-01 2013-01-02 2011-01-01
A 2010-02-01 2013-01-02 2012-01-01
A 2010-02-01 2013-01-02 2013-01-01
B 2012-03-01 2013-01-04 2013-01-01
Another solution:
#create start year
df['year'] = df['enroll'] + pd.offsets.YearBegin(0)
#count repeating
a = df['lapse'].dt.year - df['enroll'].dt.year
df = df.loc[np.repeat(df.index, a)]
#add year offset
df['a'] = df.groupby(level=0).cumcount()
df["year"] = df.apply(lambda x: x["year"] + pd.offsets.DateOffset(years=x['a']), axis=1)
df = df.drop('a', 1)
print (df)
enroll lapse year
A 2010-02-01 2013-01-02 2011-01-01
A 2010-02-01 2013-01-02 2012-01-01
A 2010-02-01 2013-01-02 2013-01-01
B 2012-03-01 2013-01-04 2013-01-01
And last:
df1 = df.groupby(df['year'].dt.year).size().reset_index(name='num')
print (df1)
year num
0 2011 1
1 2012 1
2 2013 2
first read your policy data, line by line.
enroll lapse
A 2010/2/1 2013/1/2
B 2012/3/1 2012/1/4
and then put each line into function count.
dictionary result might be the one you want ?
If there's any misunderstanding of your question, please let me know.
result = {}
def count(start, end):
start = [int(i) for i in start.split('/')]
start = datetime.date(*start)
end = [int(i) for i in end.split('/')]
end = datetime.date(*end)
delta = end - start
new = start + datetime.timedelta(delta.days)
for i in range(1, new.year - start.year + 1):
result[start.year + i] = result.setdefault(start.year + i, 0) + 1
a = count('2010/2/1', '2013/1/2')
b = count('2012/3/1', '2013/1/4')
you can use pd.daterange
start = pd.Timestamp(year=df['enroll'].dt.year.min() + 1, month=1, day=1)
end = pd.Timestamp(year=df['lapse'].dt.year.max(), month=12, day=31)
for year in pd.date_range(start=start, end=end, freq='AS'):
print(year, ((df['enroll'] < year) & (df['lapse'] > year)).sum())
2011-01-01 00:00:00 1
2012-01-01 00:00:00 1
2013-01-01 00:00:00 2
data = {year.year: ((df['enroll'] < year) & (df['lapse'] > year)).sum() for year in pd.date_range(start=start, end=end, freq='AS')}
pd.Series(data)
2011 1
2012 1
2013 2
dtype: int64
I have a pandas dataframe with three columns. A start and end date and a month.
I would like to add a column for how many days within the month are between the two dates. I started doing something with apply, the calendar library and some math, but it started to get really complex. I bet pandas has a simple solution, but am struggling to find it.
Input:
import pandas as pd
df1 = pd.DataFrame(data=[['2017-01-01', '2017-06-01', '2016-01-01'],
['2015-03-02', '2016-02-10', '2016-02-01'],
['2011-01-02', '2018-02-10', '2016-03-01']],
columns=['start date', 'end date date', 'Month'])
Desired Output:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
There is a solution:
get a date list by pd.date_range between start and end dates, and then check how many date has the same year and month with the target month.
def overlap(x):
md = pd.to_datetime(x[2])
cand = [(ad.year, ad.month) for ad in pd.date_range(x[0], x[1])]
return len([x for x in cand if x ==(md.year, md.month)])
df1["Days in Month"]= df1.apply(overlap, axis=1)
You'll get:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
You can convert your cell to datetime by
df = df.applymap(lambda x: pd.to_datetime(x))
Then find intersection days with function
def intersectionDaysInMonth(start, end, month):
end_month = month.replace(month=month.month + 1)
if month <= start <= end_month:
return end_month - start
if month <= end <= end_month:
return end - month
if start <= month < end_month <= end:
return end_month - month
return pd.to_timedelta(0)
Then apply
df['Days in Month'] = df.apply(lambda row: intersectionDaysInMonth(*row).days, axis=1)