I would like to create a new column every time I get 1 in the 'Signal' column that will cast the corresponding value from the 'Value' column (please see the expected output below).
Initial data:
Index
Value
Signal
0
3
0
1
8
0
2
8
0
3
7
1
4
9
0
5
10
0
6
14
1
7
10
0
8
10
0
9
4
1
10
10
0
11
10
0
Expected Output:
Index
Value
Signal
New_Col_1
New_Col_2
New_Col_3
0
3
0
0
0
0
1
8
0
0
0
0
2
8
0
0
0
0
3
7
1
7
0
0
4
9
0
7
0
0
5
10
0
7
0
0
6
14
1
7
14
0
7
10
0
7
14
0
8
10
0
7
14
0
9
4
1
7
14
4
10
10
0
7
14
4
11
10
0
7
14
4
What would be a way to do it?
You can use a pivot:
out = df.join(df
# keep only the values where signal is 1
# and get Signal's cumsum
.assign(val=df['Value'].where(df['Signal'].eq(1)),
col=df['Signal'].cumsum()
)
# pivot cumsumed Signal to columns
.pivot(index='Index', columns='col', values='val')
# ensure column 0 is absent (using loc to avoid KeyError)
.loc[:, 1:]
# forward fill the values
.ffill()
# rename columns
.add_prefix('New_Col_')
)
output:
Index Value Signal New_Col_1 New_Col_2 New_Col_3
0 0 3 0 NaN NaN NaN
1 1 8 0 NaN NaN NaN
2 2 8 0 NaN NaN NaN
3 3 7 1 7.0 NaN NaN
4 4 9 0 7.0 NaN NaN
5 5 10 0 7.0 NaN NaN
6 6 14 1 7.0 14.0 NaN
7 7 10 0 7.0 14.0 NaN
8 8 10 0 7.0 14.0 NaN
9 9 4 1 7.0 14.0 4.0
10 10 10 0 7.0 14.0 4.0
11 11 10 0 7.0 14.0 4.0
#create new column by incrementing the rows that has signal
df['new_col']='new_col_'+df['Signal'].cumsum().astype(str)
#rows having no signal, make them null
df['new_col'] = df['new_col'].mask(df['Signal']==0, '0')
#pivot table
df2=(df.pivot(index=['Index','Signal', 'Value'], columns='new_col', values='Value')
.reset_index()
.ffill().fillna(0) #forward fill and fillna with 0
.drop(columns=['0','Index'] ) #drop the extra columns
.rename_axis(columns={'new_col':'Index'}) # rename the axis
.astype(int)) # changes values to int, removing decimals
df2
Index Signal Value new_col_1 new_col_2 new_col_3
0 0 3 0 0 0
1 0 8 0 0 0
2 0 8 0 0 0
3 1 7 7 0 0
4 0 9 7 0 0
5 0 10 7 0 0
6 1 14 7 14 0
7 0 10 7 14 0
8 0 10 7 14 0
9 1 4 7 14 4
10 0 10 7 14 4
11 0 10 7 14 4
Related
I have the following dataframe
print(A)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
And I have the following Code (Pandas Dataframe count occurrences that only happen immediately), which counts the occurrences of values that happen immediately one after another.
ser = A["1or0"].ne(A["1or0"].shift().bfill()).cumsum()
B = (
A.groupby(ser, as_index=False)
.agg({"Index": ["first", "last", "count"],
"1or0": "unique"})
.set_axis(["StartNum", "EndNum", "Size", "Value"], axis=1)
.assign(Value= lambda d: d["Value"].astype(str).str.strip("[]"))
)
print(B)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
The issues is, when NaN Values occur, the code doesn't put them together in one interval it count them always as one sized interval and not e.g. 3
print(A2)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
10 11 NaN
11 12 NaN
12 13 NaN
print(B2)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 11 1 NaN
5 12 12 1 NaN
6 13 13 1 NaN
But I want B2 to be the following
print(B2Wanted)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 13 3 NaN
What do I need to change so that it works also with NaN?
First fillna with a value this is not possible (here -1) before creating your grouper:
group = A['1or0'].fillna(-1).diff().ne(0).cumsum()
# or
# s = A['1or0'].fillna(-1)
# group = s.ne(s.shift()).cumsum()
B = (A.groupby(group, as_index=False)
.agg(**{'StartNum': ('Index', 'first'),
'EndNum': ('Index', 'last'),
'Size': ('1or0', 'size'),
'Value': ('1or0', 'first')
})
)
Output:
StartNum EndNum Size Value
0 1 3 3 0.0
1 4 7 4 1.0
2 8 8 1 0.0
3 9 10 2 1.0
4 11 13 3 NaN
I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?
Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0
I have a datframe as :
data=[[0,1,5],
[0,1,6],
[0,0,8],
[0,0,10],
[0,1,12],
[0,0,14],
[0,1,16],
[0,1,18],
[1,0,2],
[1,1,0],
[1,0,1],
[1,0,2]]
df = pd.DataFrame(data,columns=['KEY','COND','VAL'])
For RES1, I want to create a counter variable RES where COND ==1. The value of RES for the
first KEY of the group remains same as the VAL (Can I use cumcount() in some way).
For RES2, then I just want to fill the missing values as
the previous value. (df.fillna(method='ffill')), I am thinking..
KEY COND VAL RES1 RES2
0 0 1 5 5 5
1 0 1 6 6 6
2 0 0 8 6
3 0 0 10 6
4 0 1 12 7 7
5 0 0 14 7
6 0 1 16 8 8
7 0 1 18 9 9
8 1 0 2 2 2
9 1 1 0 3 3
10 1 0 1 3
11 1 0 2 3
Aim is to look fir a vectorized solution that's most optimal over million rows.
IIUC
con=(df.COND==1)|(df.index.isin(df.drop_duplicates('KEY').index))
df['res1']=df.groupby('KEY').VAL.transform('first')+
df.groupby('KEY').COND.cumsum()[con]-
df.groupby('KEY').COND.transform('first')
df['res2']=df.res1.ffill()
df
Out[148]:
KEY COND VAL res1 res2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
You want:
s = (df[df.KEY.duplicated()] # Ignore first row in each KEY group
.groupby('KEY').COND.cumsum() # Counter within KEY
.add(df.groupby('KEY').VAL.transform('first')) # Add first value per KEY
.where(df.COND.eq(1)) # Set only where COND == 1
.add(df.loc[~df.KEY.duplicated(), 'VAL'], fill_value=0) # Set 1st row by KEY
)
df['RES1'] = s
df['RES2'] = df['RES1'].ffill()
KEY COND VAL RES1 RES2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
I want to do a cumulative sum on a pandas dataframe without carrying over the sum to last zero values. For example, give a dataframe:
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 2
7 0 0
8 0 0
9 0 0
cumulative sum of index 1 to 6 only:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
If want not use cumsum for last 0 values in all columns:
Compare if row no contains 0, shift mask and use cumulative sum. Last compare with last value and filter:
a = df.ne(0).any(1).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
Similar solution if want processes each column separately - only omit any:
print (df)
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 0
7 0 0
8 0 0
9 0 0
a = df.ne(0).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 0
7 0 0
8 0 0
9 0 0
Use
In [262]: s = df.ne(0).all(1)
In [263]: l = s[s].index[-1]
In [264]: df[:l] = df.cumsum()
In [265]: df
Out[265]:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
I will use last_valid_index
v=df.replace(0,np.nan).apply(lambda x : x.last_valid_index())
df[pd.DataFrame(df.index.values<=v.values[:,None],columns=df.index,index=df.columns).T].cumsum().fillna(0)
Out[890]:
A B
1 1.0 2.0
2 6.0 2.0
3 16.0 2.0
4 26.0 3.0
5 26.0 4.0
6 31.0 6.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
To skip all rows after the first 0, 0 row, get the first index (by rows) where df['A'] and df[B] are 0 using idxmax(0)
>>> m = ((df["A"]==0) & (df["B"]==0)).idxmax(0)
>>> df[:m] = df[:m].cumsum()
>>> df
A B
0 1 2
1 6 2
2 16 2
3 26 3
4 26 4
5 31 6
6 0 0
7 0 0
8 0 0