Moving average for several not constant conditions - python

I hope find there are experts who can help)
There is such a table
X2 X3 X4 Y Y1
01.02.2019 1 1 1
02.02.2019 2 2 0
02.02.2019 2 3 0
02.02.2019 2 1 1
03.02.2019 1 2 1
04.02.2019 2 3 0
05.02.2019 1 1 1
06.02.2019 2 2 0
07.02.2019 1 3 1
08.02.2019 2 1 1
09.02.2019 1 2 0
10.02.2019 2 3 1
11.02.2019 1 1 0
12.02.2019 2 2 1
13.02.2019 1 3 0
14.02.2019 2 1 1
15.02.2019 1 2 1
16.02.2019 2 3 0
17.02.2019 1 1 1
18.02.2019 2 2 0
And in column Y1 it is necessary to calculate the moving average of column Y for the last 5 days, but only with filtering by condition X3 and X4. The filter is equal to the current value of the columns for the current row.
For example, for the string
02/04/2019 2 3 0 the average will be equal to 0, because for it only the string matches the condition
02.02.2019 2 3 0
How to do this I do not understand, I know that it will be something like
filtered_X4 = df ['X4']. where (condition_1 & condition_2 & condition_3)
But how to set the conditions themselves condition_1,2,3 I do not understand.
Saw many examples when the filter is known, for example
condition_1 = df ['X2']. isin ([2, 3, 5])
but that's not what i need, because my condition values change with the string
How to calculate the mean I know
df ['Y1'] = filtered_X4.shift (1) .rolling (window = 999999, min_periods = 1) .mean ()
but can't configure filtering.
add1: This is the result I'm trying to get:
X2 X3 X4 Y Y1
01.02.2019 1 1 1 NAN
02.02.2019 2 2 0 NAN
02.02.2019 2 3 0 NAN
02.02.2019 2 1 1 NAN
03.02.2019 1 2 1 NAN
04.02.2019 2 3 0 0
05.02.2019 1 1 1 1
06.02.2019 2 2 0 0
07.02.2019 1 3 1 NAN
08.02.2019 2 1 1 NAN
09.02.2019 1 2 0 NAN
10.02.2019 2 3 1 NAN
11.02.2019 1 3 0 1
12.02.2019 2 2 1 NAN
13.02.2019 1 3 0 0
14.02.2019 2 1 1 NAN
15.02.2019 2 2 1 1
16.02.2019 2 3 0 NAN
17.02.2019 1 1 1 NAN
18.02.2019 2 2 0 1
For example, to calculate the average (Y1) of this line:
X2 X3 X4 Y Y1
04.02.2019 2 3 0
I need to take only the strings from the dateframe with X3 = 2 and X4 = 3 and X2 from 30.01.2019 to 03.02.2019

To do this, use .apply()
Convert date to datetime.
df['X2'] = pd.to_datetime(df['X2'], format='%d.%m.%Y')
print(df)
X2 X3 X4 Y
0 2019-02-01 1 1 1
1 2019-02-02 2 2 0
2 2019-02-02 2 3 0
3 2019-02-02 2 1 1
4 2019-02-03 1 2 1
5 2019-02-04 2 3 0
6 2019-02-05 1 1 1
7 2019-02-06 2 2 0
8 2019-02-07 1 3 1
9 2019-02-08 2 1 1
10 2019-02-09 1 2 0
11 2019-02-10 2 3 1
12 2019-02-11 1 3 0
13 2019-02-12 2 2 1
14 2019-02-13 1 3 0
15 2019-02-14 2 1 1
16 2019-02-15 2 2 1
17 2019-02-16 2 3 0
18 2019-02-17 1 1 1
19 2019-02-18 2 2 0
Using apply and lambda, create a df.loc filter for each row, restricting by date to the previous 5 days, and also for equality in columns X2 and X3, then calculate the mean of 'Y'.
df['Y1'] = df.apply(
lambda x: df.loc[
(
(df.X2 < x.X2)
& (df.X2 >= (x.X2 + pd.DateOffset(days=-4)))
& (df.X3 == x.X3)
& (df.X4 == x.X4)
),
"Y",
].mean(),
axis=1,
)
print(df)
X2 X3 X4 Y Y1
0 2019-02-01 1 1 1 NaN
1 2019-02-02 2 2 0 NaN
2 2019-02-02 2 3 0 NaN
3 2019-02-02 2 1 1 NaN
4 2019-02-03 1 2 1 NaN
5 2019-02-04 2 3 0 0.0
6 2019-02-05 1 1 1 1.0
7 2019-02-06 2 2 0 0.0
8 2019-02-07 1 3 1 NaN
9 2019-02-08 2 1 1 NaN
10 2019-02-09 1 2 0 NaN
11 2019-02-10 2 3 1 NaN
12 2019-02-11 1 3 0 1.0
13 2019-02-12 2 2 1 NaN
14 2019-02-13 1 3 0 0.0
15 2019-02-14 2 1 1 NaN
16 2019-02-15 2 2 1 1.0
17 2019-02-16 2 3 0 NaN
18 2019-02-17 1 1 1 NaN
19 2019-02-18 2 2 0 1.0
Y1 result is in dtype float since np.NaN is not compatible with integer series. If you need integers, use the following workaround.
col = 'Y1'
​
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
​
print(df)
X2 X3 X4 Y Y1
0 2019-02-01 1 1 1 NaN
1 2019-02-02 2 2 0 NaN
2 2019-02-02 2 3 0 NaN
3 2019-02-02 2 1 1 NaN
4 2019-02-03 1 2 1 NaN
5 2019-02-04 2 3 0 0
6 2019-02-05 1 1 1 1
7 2019-02-06 2 2 0 0
8 2019-02-07 1 3 1 NaN
9 2019-02-08 2 1 1 NaN
10 2019-02-09 1 2 0 NaN
11 2019-02-10 2 3 1 NaN
12 2019-02-11 1 3 0 1
13 2019-02-12 2 2 1 NaN
14 2019-02-13 1 3 0 0
15 2019-02-14 2 1 1 NaN
16 2019-02-15 2 2 1 1
17 2019-02-16 2 3 0 NaN
18 2019-02-17 1 1 1 NaN
19 2019-02-18 2 2 0 1
EDIT
Follow up question, how to apply the above daily with new data and not including old data:
You just need to filter your data to the data range you want to include.
Create a startdate in datetime
startdate = pd.to_datetime('2019-02-13')
Modify the apply function adding in an if condition:
df['Y1'] = df.apply(
lambda x: (df.loc[
(
(df.X2 < x.X2)
& (df.X2 >= (x.X2 + pd.DateOffset(days=-4)))
& (df.X3 == x.X3)
& (df.X4 == x.X4)
),
"Y",
].mean()) if x[0] >= startdate else x[3]
, axis=1
)
**This will only work after the first time you run the apply statement, otherwise you will get an out of index error. **
So run it first without the if condition then thereafter run with the if conditiion.

Related

pandas groupby then filter by date to get mean

Using pandas dataframes and I'm attempting to get the average number of purchases in the last 90 days for each row(not including the current row itself) based on CustId and then add a new column "PurchaseMeanLast90Days".
This is the code I tried, which is incorrect:
group = df.groupby(['CustId'])
df['PurchaseMeanLast90Days'] = group.apply(lambda g: g[g['Date'] > (pd.DatetimeIndex(g['Date']) + pd.DateOffset(-90))])['Purchases'].mean()
Here's my data:
Index
CustId
Date
Purchases
0
1
1/01/2021
5
1
1
1/12/2021
1
2
1
3/28/2021
2
3
1
4/01/2021
4
4
1
4/20/2021
2
5
1
5/01/2021
5
6
2
1/01/2021
1
7
2
2/01/2021
1
8
2
3/01/2021
2
9
2
4/01/2021
3
For example, row index 5 would include these rows in it's mean() = 3.33
Index
CustId
Date
Purchases
2
1
3/28/2021
2
3
1
4/01/2021
4
4
1
4/20/2021
2
The new dataframe would look like this(I didn't do the calcs for CustId=2):
Index
CustId
Date
Purchases
PurchaseMeanLast90Days
0
1
1/09/2021
5
0
1
1
1/12/2021
1
5
2
1
3/28/2021
2
3
3
1
4/01/2021
4
2.67
4
1
4/20/2021
2
3.0
5
1
5/01/2021
5
3.33
6
2
1/01/2021
1
...
7
2
2/01/2021
1
...
8
2
3/01/2021
2
...
9
2
4/01/2021
3
...
You can do a rolling computation:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=False)
df["PurchaseMeanLast90Days"] = (
(
df.groupby("CustId")
.rolling("90D", min_periods=1, on="Date", closed="both")["Purchases"]
.apply(lambda x: x.shift(1).sum() / (len(x) - 1))
)
.fillna(0)
.values
)
print(df)
Prints:
Index CustId Date Purchases PurchaseMeanLast90Days
0 0 1 2021-01-01 5 0.000000
1 1 1 2021-01-12 1 5.000000
2 2 1 2021-03-28 2 3.000000
3 3 1 2021-04-01 4 2.666667
4 4 1 2021-04-20 2 3.000000
5 5 1 2021-05-01 5 2.666667
6 6 2 2021-01-01 1 0.000000
7 7 2 2021-02-01 1 1.000000
8 8 2 2021-03-01 2 1.000000
9 9 2 2021-04-01 3 1.333333

Fill column with nan if sum of multiple columns is 0

Task
I have a df where I do some ratios that are groupby date and id. I want to fill column c with NaN if the sum of a and b is 0. Any help would be awesome!!
df
date id a b c
0 2001-09-06 1 3 1 1
1 2001-09-07 1 3 1 1
2 2001-09-08 1 4 0 1
3 2001-09-09 2 6 0 1
4 2001-09-10 2 0 0 2
5 2001-09-11 1 0 0 2
6 2001-09-12 2 1 1 2
7 2001-09-13 2 0 0 2
8 2001-09-14 1 0 0 2
Try this:
df['new_c'] = df.c.where(df[['a','b']].sum(1).ne(0))
Out[75]:
date id a b c new_c
0 2001-09-06 1 3 1 1 1.0
1 2001-09-07 1 3 1 1 1.0
2 2001-09-08 1 4 0 1 1.0
3 2001-09-09 2 6 0 1 1.0
4 2001-09-10 2 0 0 2 NaN
5 2001-09-11 1 0 0 2 NaN
6 2001-09-12 2 1 1 2 2.0
7 2001-09-13 2 0 0 2 NaN
8 2001-09-14 1 0 0 2 NaN
It is better to build a new dataframe with same shape , and then do the following :
i = 0
for line in df :
new_df[i]['date'] = line['date']
new_df[i]['a'] = line['a']
new_df[i]['b'] = line['b']
if line['a'] + line['b'] == 0 :
new_df[i]['c'] = Nan
i += 1

Merging data into an existing pandas dataframe column conditionally

I have the following data:
one_dict = {0: "zero", 1: "one", 2: "two", 3: "three", 4: "four"}
two_dict = {0: "light", 1: "calc", 2: "line", 3: "blur", 4: "color"}
np.random.seed(2)
n = 15
a_df = pd.DataFrame(dict(a=np.random.randint(0, 4, n), b=np.random.randint(0, 3, n)))
a_df["c"] = np.nan
a_df = a_df.sort_values("b").reset_index(drop=True)
where the dataframe looks as:
In [45]: a_df
Out[45]:
a b c
0 3 0 NaN
1 1 0 NaN
2 0 0 NaN
3 2 0 NaN
4 3 0 NaN
5 1 0 NaN
6 2 1 NaN
7 2 1 NaN
8 3 1 NaN
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN
I would like to replace values in c with those from dictionaries one_dict
and two_dict, with the result as follows:
In [45]: a_df
Out[45]:
a b c
0 3 0 three
1 1 0 one
2 0 0 zero
3 2 0 .
4 3 0 .
5 1 0 .
6 2 1 calc
7 2 1 calc
8 3 1 blur
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN
 Attempt
I'm not sure what a good approach to this would be though.
I thought that I might do something along the following lines:
merge_df = pd.DataFrame(dict(one = one_dict, two=two_dict)).reset_index()
merge_df['zeros'] = 0
merge_df['ones'] = 1
giving
In [62]: merge_df
Out[62]:
index one two zeros ones
0 0 zero light 0 1
1 1 one calc 0 1
2 2 two line 0 1
3 3 three blur 0 1
4 4 four color 0 1
Then merge this into the a_df, but I'm not sure how to merge in and update
at the same time, or if this is a good approach.
Edit
keys correspond to the values of column a
. is just shorthand, this should be filled in with the value as others are
This is just matter of creating new dataframe with the correct structure and merge:
(a_df.drop('c', axis=1)
.merge(pd.DataFrame([one_dict,two_dict])
.rename_axis(index='b',columns='a')
.stack().reset_index(name='c'),
on=['a','b'],
how='left')
)
Output:
a b c
0 3 0 three
1 1 0 one
2 0 0 zero
3 2 0 two
4 3 0 three
5 1 0 one
6 2 1 line
7 2 1 line
8 3 1 blur
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN

Fetch first non-zero value in previous rows in pandas

Following is what my dataframe looks like. Expected_Output is my desired column:
Group Signal Ready Value Expected_Output
0 1 0 0 3 NaN
1 1 0 1 72 NaN
2 1 0 0 0 NaN
3 1 4 0 0 72.0
4 1 4 0 0 72.0
5 1 4 0 0 72.0
6 2 0 0 0 NaN
7 2 7 0 0 NaN
8 2 7 0 0 NaN
9 2 7 0 0 NaN
If Signal > 1, then I am trying to fetch the most recent non-zero Value in the previous rows within the Group where Ready = 1. So in row 3, Signal = 4, so I want to fetch the most recent non-zero Value of 72 from row 1 where Ready = 1.
Once I can fetch the value, I can do df.groupby(['Group','Signal']).Value.transform('first') as Signals appear repeatedly like 444 but not sure how to fetch Value.
IIUC groupby + ffill with Boolean assign
df['Help']=df.Value.where(df.Ready==1).replace(0,np.nan)
df['New']=df.groupby('Group').Help.ffill()[df.Signal>1]
df
Out[1006]:
Group Signal Ready Value Expected_Output Help New
0 1 0 0 3 NaN 3.0 NaN
1 1 0 1 72 NaN 72.0 NaN
2 1 0 0 0 NaN NaN NaN
3 1 4 0 0 72.0 NaN 72.0
4 1 4 0 0 72.0 NaN 72.0
5 1 4 0 0 72.0 NaN 72.0
6 2 0 0 0 NaN NaN NaN
7 2 7 0 0 NaN NaN NaN
8 2 7 0 0 NaN NaN NaN
9 2 7 0 0 NaN NaN NaN
Create a series via GroupBy + ffill, then mask the resultant series:
s = df.assign(Value_mask=df['Value'].where(df['Ready'].eq(1)))\
.groupby('Group')['Value_mask'].ffill()
df['Value'] = s.where(df['Signal'].gt(1))
Group Signal Ready Value
0 1 0 0 NaN
1 1 0 1 NaN
2 1 0 0 NaN
3 1 4 0 72.0
4 1 4 0 72.0
5 1 4 0 72.0
6 2 0 0 NaN
7 2 7 0 NaN
8 2 7 0 NaN
9 2 7 0 NaN

Use fillna-method per specific segments in dataframe

Currently I have following dataframe, where F1-F4 are some segments
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 NaN 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 NaN 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 NaN 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 NaN 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 NaN 0 0 1 0
20:45 1 3 5 7 NaN 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
What is the best approach to achieve next dataset after some manipulations?
E(06:15) = MEAN( AVG[E(06:00-06:30)], AVG[06:15(A-E)] ) #F1==1
E(20:45) = MEAN( AVG[E(20:45-21:00)], AVG[20:45(A-E)] ) #F4==1
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 [X0] 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 [X1] 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 [X2] 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 [X3] 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 [X4] 0 0 1 0
20:45 1 3 5 7 [X5] 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
I was trying to use an idea like below, but without success so far
In[89]: df.groupby(['F1', 'F2', 'F3', 'F4'], as_index=False).median()
Out[89]:
F1 F2 F3 F4 A B C D E
0 0 0 0 1 2.0 3.0 2.0 2.0 0.0
1 0 0 1 0 1.5 2.0 3.0 3.5 1.0
2 0 1 0 0 6.0 7.0 6.0 7.0 9.0
3 1 0 0 0 3.0 4.0 3.0 4.0 4.0
and now, I am struggling with accessing to values E==0.0 via key F4==1

Categories

Resources