pandas lagged value of time series (previous year) for cohort

pandas lagged value of time series (previous year) for cohort - python

For a dataframe of:
import pandas as pd
df = pd.DataFrame({
'dt':[
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2020-01-01',
'2020-01-02',
'2020-01-03',
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2020-01-01',
'2020-01-02',
'2020-01-03'
],
'foo': [1,2,3, 4,5,6, 1,5,3, 4,10,6],
'category': [1,1,1,1,1,1, 2,2,2,2,2,2]
})
How can I find the lagged value from the previous year for each category?
df['dt'] = pd.to_datetime(df['dt'])
display(df)
Shifting the index only returns an empty result, and thus fails assignment.
df['last_year'] = df[df.dt == df.dt - pd.offsets.Day(365)]
Obviously, a join with the data from 2019 on the month and day would work - but seem rahter cumbersome. Is there a better way?
edit
the desired result:
dt foo category last_year
2020-01-01 4 1 1
2020-01-02 5 1 2
2020-01-03 6 1 3
2020-01-01 4 2 1
2020-01-02 10 2 5
2020-01-03 6 2 3

you can merge df with itself after you assign the column dt with the difference you want with pd.DateOffset.
print (df.merge(df.assign(dt=lambda x: x['dt']+pd.DateOffset(years=1)),
on=['dt', 'category'],
suffixes=('','_lastYear'),
how='left'))
dt foo category foo_lastYear
0 2019-01-01 1 1 NaN
1 2019-01-02 2 1 NaN
2 2019-01-03 3 1 NaN
3 2020-01-01 4 1 1.0
4 2020-01-02 5 1 2.0
5 2020-01-03 6 1 3.0
6 2019-01-01 1 2 NaN
7 2019-01-02 5 2 NaN
8 2019-01-03 3 2 NaN
9 2020-01-01 4 2 1.0
10 2020-01-02 10 2 5.0
11 2020-01-03 6 2 3.0

Related

Pandas rolling by date interval returning wrong result

I have this data:
data = {
'id': [1, 2, 3, 4, 5, 6],
'number': [2, 3, 5, 6, 7, 8],
'date': ['2010-01-01', '2010-01-01', '2020-01-04', '2020-01-04', '2020-01-04', '2020-01-05']
}
df = pd.DataFrame(data)
I need to get the mean of col number in the last 1 day.
df.index = pd.to_datetime(df['date'])
df['mean_number'] = df['number'].rolling('1D').mean().shift()
Ps: I use .shift() for the mean not to include the current line
Result in this:
id number date mean_number
date
2010-01-01 1 2 2010-01-01 NaN
2010-01-01 2 3 2010-01-01 2.0
2020-01-04 3 5 2020-01-04 2.5
2020-01-04 4 6 2020-01-04 5.0
2020-01-04 5 7 2020-01-04 5.5
2020-01-05 6 8 2020-01-05 6.0
Id 1 is right, because there is no data before.
Id 2 is right, because is doing the mean only of the id 1.
Id 3 is wrong, because I only set 1D in the rolling window, so it was only supposed to be included 2020-01-03 and 2020-01-04.
Id 4 is right, because is doing the mean only of the id 3.
Id 5 is right, because is doing the mean only of the id 3 and id 4 (the 2 are in the range of 1D).
Id 6 is right, because is doing the mean only of the id 3, id 4 and id 5 (the 3 are in the range of 1D).
What am i doing wrong and can i fix it?

Try:
df['mean_number'] = df['number'].rolling('1D', closed='left').mean()
Result:
id number date mean_number
date
2010-01-01 1 2 2010-01-01 NaN
2010-01-01 2 3 2010-01-01 2.0
2020-01-04 3 5 2020-01-04 NaN
2020-01-04 4 6 2020-01-04 5.0
2020-01-04 5 7 2020-01-04 5.5
2020-01-05 6 8 2020-01-05 6.0
🤔 humm~ not 100% sure what u are trying to do. But u can try to set the 'closed' parameter instead of using .shift()
For more detail, u can check this out: Windowing operation

Adding Missing Dates with 0 in Quantity in Python

I have a data frame that consist of a number of different item numbers from different locations. The problem is that I am missing dates for all the different combos. So for example for item number 1, I want all the dates that are missing for all the locations. What is the best way to add dates with quantity 0 for every single item at every single location for days that don't exist in the data set? Please and thank you!
I tried the following
df.set_index(data["DATE", "ITEMNUMBER"], inplace=True)
df = data.resample('D').sum().fillna(0)
Which gives me the following error - ValueError: Length mismatch: Expected 1 rows, received array of length 749629
So I tried the following -
df.set_index(data["DATE", "ITEMNUMBER"], inplace=True)
df = data.resample('D').sum().fillna(0)
That results in a Key Error if tolerance is not None:

To get all combinations of DATE, ITEMNUMBER, and LOCATION you can try:
import itertools
df2 = df.set_index(["DATE", "ITEMNUMBER", "LOCATION"])
df2 = df2.reindex(itertools.product(df['DATE'].unique(),
df['ITEMNUMBER'].unique(),
df['LOCATION'].unique())
).fillna(0).reset_index()
df2
example input:
DATE ITEMNUMBER LOCATION QUANTITY
0 2021-07-28 1 A 0
1 2021-07-28 2 B 1
2 2021-07-28 1 B 2
3 2021-07-29 1 A 3
4 2021-07-30 2 A 4
output:
DATE ITEMNUMBER LOCATION QUANTITY
0 2021-07-28 1 A 0.0
1 2021-07-28 1 B 2.0
2 2021-07-28 2 A 0.0
3 2021-07-28 2 B 1.0
4 2021-07-29 1 A 3.0
5 2021-07-29 1 B 0.0
6 2021-07-29 2 A 0.0
7 2021-07-29 2 B 0.0
8 2021-07-30 1 A 0.0
9 2021-07-30 1 B 0.0
10 2021-07-30 2 A 4.0
11 2021-07-30 2 B 0.0

Using a toy data frame:
>>> df = pd.DataFrame([{'date': '2014-07-14', 'id': 1, 'q': 1}, {'date': '2014-07-15', 'id': 1, 'q': 1}, {'date': '2014-07-17', 'id': 1, 'q': 1}, {'date': '2014-07-18', 'id': 1, 'q': 2}, {'date': '2014-07-14', 'id': 5, 'q': 2}])
>>> df
date id q
0 2014-07-14 1 1
1 2014-07-15 1 1
2 2014-07-17 1 1
3 2014-07-18 1 2
4 2014-07-14 5 2
I convert the dates to date times, then within each ID, reindex between the index minimum and maximum, creating empty rows. I then fill the quantity column q with 0 for np.nan and forward fill remaining nulls.
>>> df.assign(date=lambda df: pd.to_datetime(df['date'])) \
.set_index('date').groupby('id') \
.apply(lambda df: df.reindex(pd.date_range(df.index.min(), df.index.max(), freq='D'))) \
.assign(q=lambda df: df['q'].fillna(0)). \
.groupby(level=0).ffill()
id q
id
1 2014-07-14 1.0 1.0
2014-07-15 1.0 1.0
2014-07-16 1.0 0.0
2014-07-17 1.0 1.0
2014-07-18 1.0 2.0
5 2014-07-14 5.0 2.0
I'm not sure how you want to deal with the location column. My answer is simplified by removing that column entirely.
If you don't yourself know, do not ffill at the end. Instead, group by and assign an ffill of the ID column only back to ID, leaving the location as nan.

Pandas: query() groupby() mean() using second column list

I'm trying to decypher some inherited pandas code and cannot determine what the list [['DemandRate','DemandRateQtr','AcceptRate']] is doing in this line of code:
plot_data = (my_dataframe.query("quote_date>'2020-02-01'")
.groupby(['quote_date'])[['DemandRate', 'DemandRateQtr', 'AcceptRate']]
.mean()
.reset_index()
)
Can anyone tell me what the list does?

It is filter by columns names, here are aggregate only columns from list.
['DemandRate', 'DemandRateQtr', 'AcceptRate']
If there are some another columns like this list and from by list(here ['quote_date']) are omitted:
my_dataframe = pd.DataFrame({
'quote_date':pd.date_range('2020-02-01', periods=3).tolist() * 2,
'DemandRate':[4,5,4,5,5,4],
'DemandRateQtr':[7,8,9,4,2,3],
'AcceptRate':[1,3,5,7,1,0],
'column':[5,3,6,9,2,4]
})
print(my_dataframe)
quote_date DemandRate DemandRateQtr AcceptRate column
0 2020-02-01 4 7 1 5
1 2020-02-02 5 8 3 3
2 2020-02-03 4 9 5 6
3 2020-02-01 5 4 7 9
4 2020-02-02 5 2 1 2
5 2020-02-03 4 3 0 4
plot_data = (my_dataframe.query("quote_date>'2020-02-01'")
.groupby(['quote_date'])[['DemandRate', 'DemandRateQtr', 'AcceptRate']]
.mean()
.reset_index())
print (plot_data)
#here is not column
quote_date DemandRate DemandRateQtr AcceptRate
0 2020-02-02 5.0 5.0 2.0
1 2020-02-03 4.0 6.0 2.5

Pandas dataframe condition on datetime at other rows

My dataframe is shown as follows:
User Date Unit
1 A 2000-10-31 1
2 A 2001-10-31 2
3 A 2002-10-31 1
4 A 2003-10-31 2
5 B 2000-07-31 1
6 B 2000-08-31 2
7 B 2001-07-31 1
8 B 2002-06-30 1
9 B 2002-07-31 1
10 B 2002-08-31 1
I want to make the following judgement:
(1) For the 'User' with 'Unit' in the same month in the past consecutive two years. The data should be classified as 'Routine' with a dummy variable 1.
(2) Otherwise, the data should be classified as 0 in the 'Routine' column.
(3) For the data do not have two past consecutive years. The 'Routine' column should show NaN.
My desired output is:
User Date Unit Routine
1 A 2000-10-31 1 NaN
2 A 2001-10-31 2 NaN
3 A 2002-10-31 1 1
4 A 2003-10-31 2 1
5 B 2000-07-31 1 NaN
6 B 2000-08-31 2 NaN
7 B 2001-07-31 1 NaN
8 B 2002-06-30 1 0
9 B 2002-07-31 1 1
10 B 2002-08-31 1 0
The code of the dataframe is shown as follows:
df=pd.DataFrame({'User':list('AAAABBBBBB'),
'Date':['2000-10-31','2001-10-31','2002-10-31','2003-10-31','2000-07-31',
'2000-08-31','2001-07-31','2002-06-30','2002-07-31','2002-08-31'],
'Unit':[1,2,1,2,1,2,1,1,1,1]})
df['Date']=pd.to_datetime(df['Date'])
I want to use groupby function since there are many users in the dataframe. Thank you.

The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'User': list('AAAABBBBBB'),
'Date': [
'2000-10-31', '2001-10-31', '2002-10-31', '2003-10-31',
'2000-07-31', '2000-08-31', '2001-07-31', '2002-06-30',
'2002-07-31', '2002-08-31'],
'Unit': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1]})
df['Date'] = pd.to_datetime(df['Date'])
def routine(user, cdate, unit):
result = np.nan
two_years = [cdate.year - 1, cdate.year - 2]
mask = df.User == user
mask = mask & df.Date.dt.year.isin(two_years)
sdf = df[mask]
years = sdf.Date.dt.year.to_list()
got_years = all([y in years for y in two_years])
result = 0 if (sdf.shape[0] > 0) & got_years else result
mask2 = (sdf.Date.dt.month == cdate.month) & (sdf.Unit == unit)
sdf = sdf[mask2]
result = 1 if (sdf.shape[0] > 0) & got_years else result
return result
df['Routine'] = df.apply(
lambda row: routine(row['User'], row['Date'], row['Unit']), axis=1)
print(df)
Output:
User Date Unit Routine
0 A 2000-10-31 1 NaN
1 A 2001-10-31 2 NaN
2 A 2002-10-31 1 1.0
3 A 2003-10-31 2 1.0
4 B 2000-07-31 1 NaN
5 B 2000-08-31 2 NaN
6 B 2001-07-31 1 NaN
7 B 2002-06-30 1 0.0
8 B 2002-07-31 1 1.0
9 B 2002-08-31 1 0.0

Replace one column's values with NaN based on date conditions in Pandas

I try to replace/update price column's values based on condition of: if date is equal to 2019-09-01, then replace or update them with with np.nan, I use two methods but not worked out so far:
price pct date
0 10379.00000 0.0242 2019/6/1
1 10608.25214 NaN 2019/9/1
2 10400.00000 0.0658 2019/6/1
3 10258.48471 NaN 2019/9/1
4 12294.00000 0.1633 2019/6/1
5 11635.07402 NaN 2019/9/1
6 12564.00000 -0.0066 2019/6/1
7 13615.10992 NaN 2019/9/1
Solution 1: df.price.where(df.date == '2019-09-01', np.nan, inplace=True), but it replaced all price values with NaN
price pct date
0 NaN 0.0242 2019-06-01
1 NaN NaN 2019-09-01
2 NaN 0.0658 2019-06-01
3 NaN NaN 2019-09-01
4 NaN 0.1633 2019-06-01
5 NaN NaN 2019-09-01
6 NaN -0.0066 2019-06-01
7 NaN NaN 2019-09-01
Solution 2: df.loc[df.date == '2019-09-01', 'price'] = np.nan, this didn't replace values.
price pct date
0 10379.00000 0.0242 2019-06-01
1 10608.25214 NaN 2019-09-01
2 10400.00000 0.0658 2019-06-01
3 10258.48471 NaN 2019-09-01
4 12294.00000 0.1633 2019-06-01
5 11635.07402 NaN 2019-09-01
6 12564.00000 -0.0066 2019-06-01
7 13615.10992 NaN 2019-09-01
Please note date in excel file before read_excel is 2019/9/1 format, I have converted it with df['date'] = pd.to_datetime(df['date']).dt.date.
Someone why this doesn't work? Thanks.

'2019-06-01' is a string, df.date is a datetime
you should convert df.date to str to match
df.loc[df.date.astype(str) == '2019-06-01', 'price'] = np.nan

Actually the first solution works (kind of) for me, try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [3, 2, 1], [5, 6, 7]]),
columns=['a', 'b', 'c']
)
The df should be:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 3 2 1
4 5 6 7
Then using the similiar code:
df.a.where(df.c != 7, np.nan, inplace=True)
I got the df as:
a b c
0 1.0 2 3
1 4.0 5 6
2 7.0 8 9
3 3.0 2 1
4 NaN 6 7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas lagged value of time series (previous year) for cohort - python

Related

Pandas rolling by date interval returning wrong result

Adding Missing Dates with 0 in Quantity in Python

Pandas: query() groupby() mean() using second column list

Pandas dataframe condition on datetime at other rows

Replace one column's values with NaN based on date conditions in Pandas

Categories

Resources