Adding a year to a period? - python

I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.

This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31

Related

Get rolling average without every timestamp

I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?
Note: For any day not in the data, assume 0 messages sent.
Raw Data:
Account | Messages | Date | Hour
12 5 2022-07-11 09:00:00
12 6 2022-07-13 10:00:00
12 10 2022-07-13 11:00:00
12 9 2022-07-15 16:00:00
12 1 2022-07-19 13:00:00
15 2 2022-07-12 10:00:00
15 13 2022-07-13 11:00:00
15 3 2022-07-17 16:00:00
15 4 2022-07-22 13:00:00
Desired Output:
Account | Messages | Date | Hour | Rolling Previous 7 Day Average
12 5 2022-07-11 09:00:00 0
12 6 2022-07-13 10:00:00 0.714
12 10 2022-07-13 11:00:00 0.714
12 9 2022-07-15 16:00:00 3
12 1 2022-07-19 13:00:00 3.571
15 2 2022-07-12 10:00:00 0
15 13 2022-07-13 11:00:00 0.286
15 3 2022-07-17 16:00:00 2.143
15 4 2022-07-22 13:00:00 0.429
I hope I've understood your question right:
df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
"sum"
)
df["Rolling Previous 7 Day Average"] = (
df.set_index("Date")
.groupby("Account")["Messages_tmp"]
.rolling("7D")
.apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values
df = df.drop(columns="Messages_tmp")
print(df)
Prints:
Account Messages Date Hour Rolling Previous 7 Day Average
0 12 5 2022-07-11 09:00:00 0.000000
1 12 6 2022-07-13 10:00:00 0.714286
2 12 10 2022-07-13 11:00:00 0.714286
3 12 9 2022-07-15 16:00:00 3.000000
4 12 1 2022-07-19 13:00:00 3.571429
5 15 2 2022-07-12 10:00:00 0.000000
6 15 13 2022-07-13 11:00:00 0.285714
7 15 3 2022-07-17 16:00:00 2.142857
8 15 4 2022-07-22 13:00:00 0.428571

Facing problem in converting into datetime

df_non_holidays=pd.read_csv("holidays.csv")
print(df_non_holidays["date"])
0 25-06-21
1 28-06-21
2 29-06-21
3 30-06-21
4 01-07-21
5 02-07-21
6 05-07-21
7 06-07-21
8 07-07-21
9 08-07-21
10 09-07-21
11 12-07-21
12 13-07-21
13 14-07-21
14 15-07-21
15 16-07-21
16 19-07-21
17 20-07-21
18 22-07-21
19 23-07-21
20 26-07-21
21 27-07-21
22 28-07-21
23 29-07-21
24 30-07-21
Name: date, dtype: object
df_non_holidays["date"]= pd.to_datetime(df_non_holidays["date"])
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-01-07
5 2021-02-07
6 2021-05-07
7 2021-06-07
8 2021-07-07
9 2021-08-07
10 2021-09-07
11 2021-12-07
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]
after converting , from index no:4 , it is changing in month and date differently..
does anything wrong in approach in converting object into datetime..
please guide me..
You also need to pass the format string, or you can pass dayfirst=True as well if you don't want to pass the format. But passing dayfirst may not always work for all type of datetime string values; however, passing format is always going to work if you pass the right format.
>>> pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-07-01
5 2021-07-02
6 2021-07-05
7 2021-07-06
8 2021-07-07
9 2021-07-08
10 2021-07-09
11 2021-07-12
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]
Regarding the documentation you can use one of the following parameters
dayfirst to tell that first value is day and not month
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], dayfirst=True)
format to explicitly tell that it's %d-%m-%y
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
That gives the proper conversion
# from
date
0 25-06-21
1 01-07-21
# to
date
0 2021-06-25
1 2021-07-01

idxmax() returns index instead of timestaped

a python beginner here,
I am trying to get the highest price of a particular stock per month, and what date the maximum value occurred.
Getting the maximum value per month is okay using max()
but when I'm trying get the corresponding dates of the max price using idxmax(), my code returns the corresponding index instead of date. My code looks like this:
Max_Date = Daily_High.groupby(pd.Grouper(key="Date", freq="M")).High.idxmax()
Output
Date High
0 2020-04-30 9929
1 2020-05-31 9946
2 2020-06-30 9966
3 2020-07-31 9993
4 2020-08-31 10014
5 2020-09-30 10016
6 2020-10-31 10044
7 2020-11-30 10063
8 2020-12-31 10097
9 2021-01-31 10114
10 2021-02-28 10125
11 2021-03-31 10139
12 2021-04-30 10180
13 2021-05-31 10182
Output Should be like this
Date High Max Date
0 2020-04-30 2020-04-30
1 2020-05-31 2020-05-26
2 2020-06-30 2020-06-23
3 2020-07-31 2020-07-31
4 2020-08-31 2020-08-31
5 2020-09-30 2020-09-02
6 2020-10-31 2020-10-13
7 2020-11-30 2020-11-09
8 2020-12-31 2020-12-29
9 2021-01-31 2021-01-25
10 2021-02-28 2021-02-09
11 2021-03-31 2021-03-02
12 2021-04-30 2021-04-29
13 2021-05-31 2021-05-03
Hope you can help me to get the correct date. Thank you!
Create DatetimeIndex and remove key="Date" from pd.Grouper:
Max_Date = Daily_High.set_index('Date').groupby(pd.Grouper( freq="M")).High.idxmax()

How to prevent .diff() function to get a ridiculous value when applied to a dataframe of datetimes and NaT values in Pandas?

I have got a dataframe loc_df where all the values are datetime and some of them are NaT. This is what loc_df looks like:
loc_df = pd.DataFrame({'10101':['2020-01-03','2019-11-06','2019-10-09','2019-09-26','2019-09-19','2019-08-19','2019-08-08','2019-07-05','2019-07-04','2019-06-27','2019-05-21','2019-04-21','2019-04-15','2019-04-06','2019-03-28','2019-02-28'], '10102':['2020-01-03','2019-11-15','2019-11-11','2019-10-23','2019-10-10','2019-10-06','2019-09-26','2019-07-14','2019-05-21','2019-03-15','2019-03-11','2019-02-27','2019-02-25',None,None,None], '10103':['2019-08-27','2019-07-14','2019-06-24','2019-05-21','2019-04-11','2019-03-06','2019-02-11',None,None,None,None,None,None,None,None,None]})
loc_df = loc_df.apply(pd.to_datetime)
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
I want to know the days between the dates for each colum so I have used:
loc_df = loc_df.diff(periods = -1)
The result was:
print(loc_df)
10101 10102 10103
0 58 days 49 days 00:00:00 44 days 00:00:00
1 28 days 4 days 00:00:00 20 days 00:00:00
2 13 days 19 days 00:00:00 34 days 00:00:00
3 7 days 13 days 00:00:00 40 days 00:00:00
4 31 days 4 days 00:00:00 36 days 00:00:00
5 11 days 10 days 00:00:00 23 days 00:00:00
6 34 days 74 days 00:00:00 -88814 days +00:12:43.145224
7 1 days 54 days 00:00:00 0 days 00:00:00
8 7 days 67 days 00:00:00 0 days 00:00:00
9 37 days 4 days 00:00:00 0 days 00:00:00
10 30 days 12 days 00:00:00 0 days 00:00:00
11 6 days 2 days 00:00:00 0 days 00:00:00
12 9 days -88800 days +00:12:43.145224 0 days 00:00:00
13 9 days 0 days 00:00:00 0 days 00:00:00
14 28 days 0 days 00:00:00 0 days 00:00:00
15 NaT NaT NaT
Do you know why I high values at the end of each column? I guess it has something to do with subtract a NaT to a datetime.
Is there an alternative to my code to prevent this?
Thanks in advance
If you have some initial data:
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
You could use DataFrame.ffill to fill in the NaT values before you use diff():
loc_df = loc_df.ffill()
loc_df = loc_df.diff(periods=-1)
print(loc_df)
10101 10102 10103
0 58 days 49 days 44 days
1 28 days 4 days 20 days
2 13 days 19 days 34 days
3 7 days 13 days 40 days
4 31 days 4 days 36 days
5 11 days 10 days 23 days
6 34 days 74 days 0 days
7 1 days 54 days 0 days
8 7 days 67 days 0 days
9 37 days 4 days 0 days
10 30 days 12 days 0 days
11 6 days 2 days 0 days
12 9 days 0 days 0 days
13 9 days 0 days 0 days
14 28 days 0 days 0 days
15 NaT NaT NaT

Iterating through datetime64 columns in pandas dataframe [duplicate]

I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df

Categories

Resources