I have this dataframe object:
Date
2018-12-14
2019-01-11
2019-01-25
2019-02-08
2019-02-22
2019-07-26
What I want, if it's possible, is to add for example: 3 months to the dates, and then 3 months to the new date (original date + 3 months) and repeat this x times. I am using pd.offsets.MonthOffset but this just adds the months one time and I need to do it more times.
I don't know if it is possible but any help would be perfect.
Thank you so much for taking your time.
The expected output is (for 1 month adding 2 times):
[[2019-01-14, 2019-02-11, 2019-02-25, 2019-03-08, 2019-03-22, 2019-08-26],[2019-02-14, 2019-03-11, 2019-03-25, 2019-04-08, 2019-04-22, 2019-09-26]]
I believe you need loop with f-strings for new columns names:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthBegin(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-01 2019-02-01 2019-03-01
1 2019-01-11 2019-02-01 2019-03-01 2019-04-01
2 2019-01-25 2019-02-01 2019-03-01 2019-04-01
3 2019-02-08 2019-03-01 2019-04-01 2019-05-01
4 2019-02-22 2019-03-01 2019-04-01 2019-05-01
5 2019-07-26 2019-08-01 2019-09-01 2019-10-01
Or:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthOffset(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-14 2019-02-14 2019-03-14
1 2019-01-11 2019-02-11 2019-03-11 2019-04-11
2 2019-01-25 2019-02-25 2019-03-25 2019-04-25
3 2019-02-08 2019-03-08 2019-04-08 2019-05-08
4 2019-02-22 2019-03-22 2019-04-22 2019-05-22
5 2019-07-26 2019-08-26 2019-09-26 2019-10-26
I hope this helps
from dateutil.relativedelta import relativedelta
month_offset = [3,6,9]
for i in month_offset:
df['Date_plus_'+i+'_months'] = df['Date'].map(lambda x: x+relativedelta(months=i))
If your dates are date objects, it should be pretty easy. You can just create a timedelta of 3 months and add it to each date.
Alternatively, you can convert them to date objects with .strptime() and then do what you are suggesting. You can convert them back to a string with .strftime().
Related
I want the time without the date in Pandas.
I want to keep the time as dtype datetime64[ns] and not as an object so that I can determine periods between times.
The closest I have gotten is as follows, but it gives back the date in a new column not the time as needed as dtype datetime.
df_pres_mf['time'] = pd.to_datetime(df_pres_mf['time'], format ='%H:%M', errors = 'coerce') # returns date (1900-01-01) and actual time as a dtype datetime64[ns] format
df_pres_mf['just_time'] = df_pres_mf['time'].dt.date
df_pres_mf['normalised_time'] = df_pres_mf['time'].dt.normalize()
df_pres_mf.head()
Returns the date as 1900-01-01 and not the time that is needed.
Edit: Data
time
1900-01-01 11:16:00
1900-01-01 15:20:00
1900-01-01 09:55:00
1900-01-01 12:01:00
You could do it like Vishnudev suggested but then you would have dtype: object (or even strings, after using dt.strftime), which you said you didn't want.
What you are looking for doesn't exist, but the closest thing that I can get you is converting to timedeltas. Which won't seem like a solution at first but is actually very useful.
Convert it like this:
# sample df
df
>>
time
0 2021-02-07 09:22:00
1 2021-05-10 19:45:00
2 2021-01-14 06:53:00
3 2021-05-27 13:42:00
4 2021-01-18 17:28:00
df["timed"] = df.time - df.time.dt.normalize()
df
>>
time timed
0 2021-02-07 09:22:00 0 days 09:22:00 # this is just the time difference
1 2021-05-10 19:45:00 0 days 19:45:00 # since midnight, which is essentially the
2 2021-01-14 06:53:00 0 days 06:53:00 # same thing as regular time, except
3 2021-05-27 13:42:00 0 days 13:42:00 # that you can go over 24 hours
4 2021-01-18 17:28:00 0 days 17:28:00
this allows you to calculate periods between times like this:
# subtract the last time from the current
df["difference"] = df.timed - df.timed.shift()
df
Out[48]:
time timed difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 # <-- this is because the last
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 # time was later than the current
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 # (see below)
to get rid of odd differences, make it absolute:
df["abs_difference"] = df.difference.abs()
df
>>
time timed difference abs_difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 0 days 12:52:00 ### <<--
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 0 days 06:49:00
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 0 days 03:46:00
Use proper formatting according to your date format and convert to datetime
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
Format according to the preferred format
df['time'].dt.strftime('%H:%M')
Output
0 11:16
1 15:20
2 09:55
3 12:01
Name: time, dtype: object
I am trying to format the dates in Python using Pandas. Basically, I want to identify all the date columns and convert it to YYYY-MM-DD format, overwrite and save it.
Input:
ID
NPC_code
Date1
Date2
Date3
Date4
1
10001
10-01-2020
11012019
27-Jan-18
27Jan2016
2
10002
11-01-2020
11012020
28-Jan-18
27Jan2017
3
10003
12-01-2020
11012021
29-Jan-18
27Jan2018
4
10004
13-01-2020
11012022
30-Jan-18
27Jan2019
5
10005
14-01-2020
11012023
31-Jan-18
27Jan2020
Output:
ID
NPC_code
Date1
Date2
Date3
Date4
1
10001
2020-01-10
2019-01-11
2018-01-27
2016-01-27
2
10002
2020-01-11
2020-01-11
2018-01-28
2016-01-28
3
10003
2020-01-12
2021-01-11
2018-01-29
2016-01-29
4
10004
2020-01-13
2022-01-11
2018-01-30
2016-01-30
5
10005
2020-01-14
2023-01-11
2018-01-31
2016-01-31
If your column is a string, you will need to first use `pd.to_datetime',
df['Date'] = pd.to_datetime(df['Date'])
Then, use .dt datetime accessor with strftime:
df = pd.DataFrame({'Date':pd.date_range('2017-01-01', periods = 60, freq='D')})
df.Date.dt.strftime('%Y-%m-%d').astype(int)
Or use lambda function:
df.Date.apply(lambda x: x.strftime('%Y-%m-%d')).astype(int)
refer:
strftime-and-strptime-behavior
pd.to_datetime
df['Date1'] = pd.to_datetime(df.Date1, format='%d-%m-%Y')
pd.to_datetime(df.Date2, format='%d%m%Y')
pd.to_datetime(df.Date3, format='%d-%b-%y')
pd.to_datetime(df.Date4, format='%d%b%Y')
convert to string format:
pd.to_datetime(df.Date1, format='%d-%m-%Y').dt.strftime('%Y-%m-%d')
I am dealing with financial data which i need to extrapolate for different months. Here is my dataframe:
invoice_id,date_from,date_to
30492,2019-02-04,2019-09-18
I want to break this up for different months between date_from and date_to. Hence i need to add rows for each month with month starting date to ending date. Final output should look like:
invoice_id,date_from,date_to
30492,2019-02-04,2019-02-28
30492,2019-03-01,2019-03-31
30492,2019-04-01,2019-04-30
30492,2019-05-01,2019-05-31
30492,2019-06-01,2019-06-30
30492,2019-07-01,2019-07-31
30492,2019-08-01,2019-08-30
30492,2019-09-01,2019-09-18
Need to take care of leap year scenario as well. Is there any native method already available in pandas datetime package which i can use to achieve the desired output ?
Use:
print (df)
invoice_id date_from date_to
0 30492 2019-02-04 2019-09-18
1 30493 2019-01-20 2019-03-10
#added months between date_from and date_to
df1 = pd.concat([pd.Series(r.invoice_id,pd.date_range(r.date_from, r.date_to, freq='MS'))
for r in df.itertuples()]).reset_index()
df1.columns = ['date_from','invoice_id']
#added starts of months - sorting for correct positions
df2 = (pd.concat([df[['invoice_id','date_from']], df1], sort=False, ignore_index=True)
.sort_values(['invoice_id','date_from'])
.reset_index(drop=True))
#added MonthEnd and date_to to last rows
mask = df2['invoice_id'].duplicated(keep='last')
s = df2['invoice_id'].map(df.set_index('invoice_id')['date_to'])
df2['date_to'] = np.where(mask, df2['date_from'] + pd.offsets.MonthEnd(), s)
print (df2)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
8 30493 2019-01-20 2019-01-31
9 30493 2019-02-01 2019-02-28
10 30493 2019-03-01 2019-03-10
You can use pandas.date_range with start and end date, in combination with freq='MS' which is beginning of month and freq='M' which is end of month:
x = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='MS')
y = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='M')
df_new = pd.DataFrame({'date_from':x,
'date_to':y})
df_new['invoice_id'] = df.iloc[0]['invoice_id']
print(df_new)
date_from date_to invoice_id
0 2019-03-01 2019-02-28 30492
1 2019-04-01 2019-03-31 30492
2 2019-05-01 2019-04-30 30492
3 2019-06-01 2019-05-31 30492
4 2019-07-01 2019-06-30 30492
5 2019-08-01 2019-07-31 30492
6 2019-09-01 2019-08-31 30492
Another way, using the resample method of a datetime index:
# melt, so we have start and end dates in 1 column
df = pd.melt(df, id_vars='invoice_id')
# now set the date column as index
df.set_index(inplace=True, keys='value')
# resample to daily level
df = df.resample('D').ffill().reset_index()
# get the yr-month value of each daily row
df['yr_month'] = df['value'].dt.strftime("%Y-%m")
# Now group by month and take min/max day values
output = (df.groupby(['invoice_id', 'yr_month'])['value']
.agg({'date_from': 'min', 'date_to': 'max'})
.reset_index()
.drop(labels='yr_month', axis=1))
print(output)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
I have a DataFrame with dates as indices:
VL
2018-02-05 101.56093
2018-12-31 95.87728
2019-01-04 96.29820
2019-01-11 97.23475
2019-01-18 98.39828
2019-01-25 98.66896
2019-01-31 99.12407
2019-02-01 99.13224
2019-02-08 99.06382
2019-02-15 99.79966
I need to filter the rows so that, for each row with date D, keep it if the row with D-7 exists in the DataFrame.
Example:
2019-02-15 would remain, because 2019-02-08 is present
2019-01-31 would be filtered as 2019-01-24 is not present.
I've implemented this already using a loop but I'm wondering if there is a more pandas oriented way of doing this kind of filtering.
IIUC, you can use pd.Timedelta and isin:
df[(df['date'] - pd.Timedelta(days=7)).isin(df['date'])]
Output:
date VL
3 2019-01-11 97.23475
4 2019-01-18 98.39828
5 2019-01-25 98.66896
7 2019-02-01 99.13224
8 2019-02-08 99.06382
9 2019-02-15 99.79966
If date is in the index use this:
df[(df.index - pd.Timedelta(days=7)).isin(df.index)]
Output:
VL
date
2019-01-11 97.23475
2019-01-18 98.39828
2019-01-25 98.66896
2019-02-01 99.13224
2019-02-08 99.06382
2019-02-15 99.79966
I'm having difficulty figuring out a way to count the occurrences of holidays between datetime ranges in a dataframe. The holidays are in a list while the datetime ranges are in the dataframe as shown below: (note that this is a subset of a very large data set)
df = pd.DataFrame({'Date': ['2018-12-19 18:47','2019-01-01 06:11','2019-01-12 10:05','2019-02-17 14:22','2019-03-08 16:17','2019-03-25 17:35','2019-02-14 17:35'],
'End Date': ['2018-12-28 18:47','2019-01-05 06:11','2019-01-16 10:05','2019-02-19 14:22','2019-03-12 16:17','2019-03-26 17:35','2019-05-27 17:35']})
df['Date'] = pd.to_datetime(df['Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
Holidays = [date(2018,12,24),date(2018,12,25),date(2019,1,1),date(2019,1,21),date(2019,2,18),date(2019,3,8),date(2019,5,27)]
I've been able to find a way that determine whether or not a Holiday is within the datetime ranges, but not get an actual count.
Is there a way to alter the code below to gather the count rather than boolean values?
This is what I've tried so far:
df['Holidays'] = [any([(z>=x)&(z<=y) for z in Holidays]) for x , y in zip(df['Date'].dt.date,df['End Date'].dt.date)]
The result I'm looking for is as follows:
result = pd.DataFrame({'Date': ['2018-12-19 18:47','2019-01-01 06:11','2019-01-12 10:05','2019-02-17 14:22','2019-03-08 16:17','2019-03-25 17:35','2019-02-14 17:35'],
'End Date': ['2018-12-28 18:47','2019-01-05 06:11','2019-01-16 10:05','2019-02-19 14:22','2019-03-12 16:17','2019-03-26 17:35','2019-05-27 17:35'],
'Holidays': [2,1,0,1,1,0,3]})
We can make a function that checks this condition and then apply it row-wise.
def fn(series):
return sum([series.iloc[0] <= h <= series.iloc[1] for h in Holidays])
df.assign(Holidays=df.apply(fn, axis=1))
Date End Date Holidays
0 2018-12-19 18:47:00 2018-12-28 18:47:00 2
1 2019-01-01 06:11:00 2019-01-05 06:11:00 0
2 2019-01-12 10:05:00 2019-01-16 10:05:00 0
3 2019-02-17 14:22:00 2019-02-19 14:22:00 1
4 2019-03-08 16:17:00 2019-03-12 16:17:00 0
5 2019-03-25 17:35:00 2019-03-26 17:35:00 0
6 2019-02-14 17:35:00 2019-05-27 17:35:00 3
Your desired output is incorrect because the Holidays list has no hours for any of the date timestamps. To get the output that you posted we will have to round down to the day.
def fn(series):
return sum([series.iloc[0].floor('d') <= h <= series.iloc[1].floor('d') for h in Holidays])
df.assign(Holidays=df.apply(fn, axis=1))
Date End Date Holidays
0 2018-12-19 18:47 2018-12-28 18:47 2
1 2019-01-01 06:11 2019-01-05 06:11 1
2 2019-01-12 10:05 2019-01-16 10:05 0
3 2019-02-17 14:22 2019-02-19 14:22 1
4 2019-03-08 16:17 2019-03-12 16:17 1
5 2019-03-25 17:35 2019-03-26 17:35 0
6 2019-02-14 17:35 2019-05-27 17:35 3