Following is what my dataframe looks like. Expected_Output is my desired column:
Group Signal Ready Value Expected_Output
0 1 0 0 3 NaN
1 1 0 1 72 NaN
2 1 0 0 0 NaN
3 1 4 0 0 72.0
4 1 4 0 0 72.0
5 1 4 0 0 72.0
6 2 0 0 0 NaN
7 2 7 0 0 NaN
8 2 7 0 0 NaN
9 2 7 0 0 NaN
If Signal > 1, then I am trying to fetch the most recent non-zero Value in the previous rows within the Group where Ready = 1. So in row 3, Signal = 4, so I want to fetch the most recent non-zero Value of 72 from row 1 where Ready = 1.
Once I can fetch the value, I can do df.groupby(['Group','Signal']).Value.transform('first') as Signals appear repeatedly like 444 but not sure how to fetch Value.
IIUC groupby + ffill with Boolean assign
df['Help']=df.Value.where(df.Ready==1).replace(0,np.nan)
df['New']=df.groupby('Group').Help.ffill()[df.Signal>1]
df
Out[1006]:
Group Signal Ready Value Expected_Output Help New
0 1 0 0 3 NaN 3.0 NaN
1 1 0 1 72 NaN 72.0 NaN
2 1 0 0 0 NaN NaN NaN
3 1 4 0 0 72.0 NaN 72.0
4 1 4 0 0 72.0 NaN 72.0
5 1 4 0 0 72.0 NaN 72.0
6 2 0 0 0 NaN NaN NaN
7 2 7 0 0 NaN NaN NaN
8 2 7 0 0 NaN NaN NaN
9 2 7 0 0 NaN NaN NaN
Create a series via GroupBy + ffill, then mask the resultant series:
s = df.assign(Value_mask=df['Value'].where(df['Ready'].eq(1)))\
.groupby('Group')['Value_mask'].ffill()
df['Value'] = s.where(df['Signal'].gt(1))
Group Signal Ready Value
0 1 0 0 NaN
1 1 0 1 NaN
2 1 0 0 NaN
3 1 4 0 72.0
4 1 4 0 72.0
5 1 4 0 72.0
6 2 0 0 NaN
7 2 7 0 NaN
8 2 7 0 NaN
9 2 7 0 NaN
Related
I'm aiming to subset a pandas df using a condition and append those rows to the right of a df. For example, where Num2 is equal to 1, I want to take the following row and append it to the right of the df. The following appends every row, where as I just want to append the following row after a 1 in Num2. I'd also like to be able to append specific cols. Using below, this could be only Num1 and Num2.
df = pd.DataFrame({
'Num1' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num2' : [0,0,0,0,0,1,3,0,1,2,0,0,0,0,1,4],
'Value' : [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
})
df1 = df.add_suffix('1').join(df.shift(-1).add_suffix('2'))
intended output:
# grab all rows after a 1 in Num2
ones = df.loc[df["Num2"].shift().isin([1])]
# append these to the right
Num1 Num2 Value Num12 Num22
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 4 1 0 0 3
6 0 3 0
7 1 0 0
8 2 1 0 3 2
9 3 2 0
10 1 0 0
11 1 0 0
12 2 0 0
13 3 0 0
14 4 1 0 0 4
15 0 4 0
You can try:
df=df.join(df.shift(-1).mask(df['Num2'].ne(1)).drop('Value',1).add_suffix('2'))
OR
ones.index=ones.index-1
df=df.join(ones.drop('Value',1).add_suffix('2'))
#OR(use any 1 since both method doing the same thing)
df=pd.concat([df,ones.drop('Value',1).add_suffix('2')],axis=1)
If needed use fillna():
df[["Num12", "Num22"]]=df[["Num12", "Num22"]].fillna('')
We can do this by making new columns that are the -1 shifts of the previous three, then setting them equal to "" if Num2 isn't 1.
mask = df.Num2 != 1
df[["Num12", "Num22"]] = df[["Num1", "Num2"]].shift(-1)
df.loc[mask, ["Num12", "Num22"]] = ""
Got a warning on this, but nevertheless
>>> df[["Num12", "Num22"]] = np.where(df[['Num1', "Num2"]]['Num2'][:,np.newaxis] == 1, df[['Num1', 'Num2']].shift(-1), [np.nan, np.nan])
<stdin>:1: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
>>> df
Num1 Num2 Value Num12 Num22
0 0 0 0 NaN NaN
1 1 0 0 NaN NaN
2 2 0 0 NaN NaN
3 3 0 0 NaN NaN
4 4 0 0 NaN NaN
5 4 1 0 0.0 3.0
6 0 3 0 NaN NaN
7 1 0 0 NaN NaN
8 2 1 0 3.0 2.0
9 3 2 0 NaN NaN
10 1 0 0 NaN NaN
11 1 0 0 NaN NaN
12 2 0 0 NaN NaN
13 3 0 0 NaN NaN
14 4 1 0 0.0 4.0
15 0 4 0 NaN NaN
I have a df
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
I need to groub by a and b and then if c or d contains 1 or more nan's within groups I want the entire group in the specific column to be nan:
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 nan
1 3 1 nan
1 1 nan 3
1 1 nan 3
1 1 nan 4
and then combine c and d that there is no nan's anymore
a b c d e
0 1 nan 1 1
0 2 2 nan 2
0 2 3 nan 3
1 3 1 nan 1
1 1 nan 3 3
1 1 nan 3 3
1 1 nan 4 4
You will want to check each group for whether it is nan and then set the appropriate value (nan or existing value) and then use combine_first() to combine the columns.
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO("""
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
"""), sep=' ')
for col in ['c', 'd']:
df[col] = df.groupby(['a','b'])[col].transform(lambda x: np.nan if any(x.isna()) else x)
df['e'] = df['c'].combine_first(df['d'])
df
a b c d e
0 0 1 NaN 1.0 1.0
1 0 2 2.0 NaN 2.0
2 0 2 3.0 NaN 3.0
3 1 3 1.0 NaN 1.0
4 1 1 NaN 3.0 3.0
5 1 1 NaN 3.0 3.0
6 1 1 NaN 4.0 4.0
I have a datframe as :
data=[[0,1,5],
[0,1,6],
[0,0,8],
[0,0,10],
[0,1,12],
[0,0,14],
[0,1,16],
[0,1,18],
[1,0,2],
[1,1,0],
[1,0,1],
[1,0,2]]
df = pd.DataFrame(data,columns=['KEY','COND','VAL'])
For RES1, I want to create a counter variable RES where COND ==1. The value of RES for the
first KEY of the group remains same as the VAL (Can I use cumcount() in some way).
For RES2, then I just want to fill the missing values as
the previous value. (df.fillna(method='ffill')), I am thinking..
KEY COND VAL RES1 RES2
0 0 1 5 5 5
1 0 1 6 6 6
2 0 0 8 6
3 0 0 10 6
4 0 1 12 7 7
5 0 0 14 7
6 0 1 16 8 8
7 0 1 18 9 9
8 1 0 2 2 2
9 1 1 0 3 3
10 1 0 1 3
11 1 0 2 3
Aim is to look fir a vectorized solution that's most optimal over million rows.
IIUC
con=(df.COND==1)|(df.index.isin(df.drop_duplicates('KEY').index))
df['res1']=df.groupby('KEY').VAL.transform('first')+
df.groupby('KEY').COND.cumsum()[con]-
df.groupby('KEY').COND.transform('first')
df['res2']=df.res1.ffill()
df
Out[148]:
KEY COND VAL res1 res2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
You want:
s = (df[df.KEY.duplicated()] # Ignore first row in each KEY group
.groupby('KEY').COND.cumsum() # Counter within KEY
.add(df.groupby('KEY').VAL.transform('first')) # Add first value per KEY
.where(df.COND.eq(1)) # Set only where COND == 1
.add(df.loc[~df.KEY.duplicated(), 'VAL'], fill_value=0) # Set 1st row by KEY
)
df['RES1'] = s
df['RES2'] = df['RES1'].ffill()
KEY COND VAL RES1 RES2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
I hope find there are experts who can help)
There is such a table
X2 X3 X4 Y Y1
01.02.2019 1 1 1
02.02.2019 2 2 0
02.02.2019 2 3 0
02.02.2019 2 1 1
03.02.2019 1 2 1
04.02.2019 2 3 0
05.02.2019 1 1 1
06.02.2019 2 2 0
07.02.2019 1 3 1
08.02.2019 2 1 1
09.02.2019 1 2 0
10.02.2019 2 3 1
11.02.2019 1 1 0
12.02.2019 2 2 1
13.02.2019 1 3 0
14.02.2019 2 1 1
15.02.2019 1 2 1
16.02.2019 2 3 0
17.02.2019 1 1 1
18.02.2019 2 2 0
And in column Y1 it is necessary to calculate the moving average of column Y for the last 5 days, but only with filtering by condition X3 and X4. The filter is equal to the current value of the columns for the current row.
For example, for the string
02/04/2019 2 3 0 the average will be equal to 0, because for it only the string matches the condition
02.02.2019 2 3 0
How to do this I do not understand, I know that it will be something like
filtered_X4 = df ['X4']. where (condition_1 & condition_2 & condition_3)
But how to set the conditions themselves condition_1,2,3 I do not understand.
Saw many examples when the filter is known, for example
condition_1 = df ['X2']. isin ([2, 3, 5])
but that's not what i need, because my condition values change with the string
How to calculate the mean I know
df ['Y1'] = filtered_X4.shift (1) .rolling (window = 999999, min_periods = 1) .mean ()
but can't configure filtering.
add1: This is the result I'm trying to get:
X2 X3 X4 Y Y1
01.02.2019 1 1 1 NAN
02.02.2019 2 2 0 NAN
02.02.2019 2 3 0 NAN
02.02.2019 2 1 1 NAN
03.02.2019 1 2 1 NAN
04.02.2019 2 3 0 0
05.02.2019 1 1 1 1
06.02.2019 2 2 0 0
07.02.2019 1 3 1 NAN
08.02.2019 2 1 1 NAN
09.02.2019 1 2 0 NAN
10.02.2019 2 3 1 NAN
11.02.2019 1 3 0 1
12.02.2019 2 2 1 NAN
13.02.2019 1 3 0 0
14.02.2019 2 1 1 NAN
15.02.2019 2 2 1 1
16.02.2019 2 3 0 NAN
17.02.2019 1 1 1 NAN
18.02.2019 2 2 0 1
For example, to calculate the average (Y1) of this line:
X2 X3 X4 Y Y1
04.02.2019 2 3 0
I need to take only the strings from the dateframe with X3 = 2 and X4 = 3 and X2 from 30.01.2019 to 03.02.2019
To do this, use .apply()
Convert date to datetime.
df['X2'] = pd.to_datetime(df['X2'], format='%d.%m.%Y')
print(df)
X2 X3 X4 Y
0 2019-02-01 1 1 1
1 2019-02-02 2 2 0
2 2019-02-02 2 3 0
3 2019-02-02 2 1 1
4 2019-02-03 1 2 1
5 2019-02-04 2 3 0
6 2019-02-05 1 1 1
7 2019-02-06 2 2 0
8 2019-02-07 1 3 1
9 2019-02-08 2 1 1
10 2019-02-09 1 2 0
11 2019-02-10 2 3 1
12 2019-02-11 1 3 0
13 2019-02-12 2 2 1
14 2019-02-13 1 3 0
15 2019-02-14 2 1 1
16 2019-02-15 2 2 1
17 2019-02-16 2 3 0
18 2019-02-17 1 1 1
19 2019-02-18 2 2 0
Using apply and lambda, create a df.loc filter for each row, restricting by date to the previous 5 days, and also for equality in columns X2 and X3, then calculate the mean of 'Y'.
df['Y1'] = df.apply(
lambda x: df.loc[
(
(df.X2 < x.X2)
& (df.X2 >= (x.X2 + pd.DateOffset(days=-4)))
& (df.X3 == x.X3)
& (df.X4 == x.X4)
),
"Y",
].mean(),
axis=1,
)
print(df)
X2 X3 X4 Y Y1
0 2019-02-01 1 1 1 NaN
1 2019-02-02 2 2 0 NaN
2 2019-02-02 2 3 0 NaN
3 2019-02-02 2 1 1 NaN
4 2019-02-03 1 2 1 NaN
5 2019-02-04 2 3 0 0.0
6 2019-02-05 1 1 1 1.0
7 2019-02-06 2 2 0 0.0
8 2019-02-07 1 3 1 NaN
9 2019-02-08 2 1 1 NaN
10 2019-02-09 1 2 0 NaN
11 2019-02-10 2 3 1 NaN
12 2019-02-11 1 3 0 1.0
13 2019-02-12 2 2 1 NaN
14 2019-02-13 1 3 0 0.0
15 2019-02-14 2 1 1 NaN
16 2019-02-15 2 2 1 1.0
17 2019-02-16 2 3 0 NaN
18 2019-02-17 1 1 1 NaN
19 2019-02-18 2 2 0 1.0
Y1 result is in dtype float since np.NaN is not compatible with integer series. If you need integers, use the following workaround.
col = 'Y1'
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
print(df)
X2 X3 X4 Y Y1
0 2019-02-01 1 1 1 NaN
1 2019-02-02 2 2 0 NaN
2 2019-02-02 2 3 0 NaN
3 2019-02-02 2 1 1 NaN
4 2019-02-03 1 2 1 NaN
5 2019-02-04 2 3 0 0
6 2019-02-05 1 1 1 1
7 2019-02-06 2 2 0 0
8 2019-02-07 1 3 1 NaN
9 2019-02-08 2 1 1 NaN
10 2019-02-09 1 2 0 NaN
11 2019-02-10 2 3 1 NaN
12 2019-02-11 1 3 0 1
13 2019-02-12 2 2 1 NaN
14 2019-02-13 1 3 0 0
15 2019-02-14 2 1 1 NaN
16 2019-02-15 2 2 1 1
17 2019-02-16 2 3 0 NaN
18 2019-02-17 1 1 1 NaN
19 2019-02-18 2 2 0 1
EDIT
Follow up question, how to apply the above daily with new data and not including old data:
You just need to filter your data to the data range you want to include.
Create a startdate in datetime
startdate = pd.to_datetime('2019-02-13')
Modify the apply function adding in an if condition:
df['Y1'] = df.apply(
lambda x: (df.loc[
(
(df.X2 < x.X2)
& (df.X2 >= (x.X2 + pd.DateOffset(days=-4)))
& (df.X3 == x.X3)
& (df.X4 == x.X4)
),
"Y",
].mean()) if x[0] >= startdate else x[3]
, axis=1
)
**This will only work after the first time you run the apply statement, otherwise you will get an out of index error. **
So run it first without the if condition then thereafter run with the if conditiion.
I have a data frame like this:
A B C D
0 1 0 nan nan
1 8 0 nan nan
2 8 1 nan nan
3 2 1 nan nan
4 0 0 nan nan
5 1 1 nan nan
and i have a dictionary like this:
dc = {'C': 5, 'D' : 10}
I want to fill the nanvalues in the data frame with the dictionary but only for the cells in which the column B values are 0, i want to obtain this:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 nan nan
3 2 1 nan nan
4 0 0 5 10
5 1 1 nan nan
I know how to subset the dataframe but i can't find a way to fill the values with the dictionary; any ideas?
You could use fillna with loc and pass your dict to it:
In [13]: df.loc[df.B==0,:].fillna(dc)
Out[13]:
A B C D
0 1 0 5 10
1 8 0 5 10
4 0 0 5 10
To do it for you dataframe you need to slice with the same mask and assign the result above to it:
df.loc[df.B==0, :] = df.loc[df.B==0,:].fillna(dc)
In [15]: df
Out[15]:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 NaN NaN
3 2 1 NaN NaN
4 0 0 5 10
5 1 1 NaN NaN