Pandas : vectorized operations on maximum values per row - python

I have the following pandas dataframe df:
index A B C
1 1 2 3
2 9 5 4
3 7 12 8
... ... ... ...
I want the maximum value of each row to remain unchanged, and all the other values to become -1. The output would thus look like this :
index A B C
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
... ... ... ...
By using df.max(axis = 1), I get a pandas Series with the maximum values per row. However, I'm not sure how to use these maximums optimally to create the result I need. I'm looking for a vectorized, fast implementation.

Consider using where:
>>> df.where(df.eq(df.max(1), 0), -1)
A B C
index
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
Here df.eq(df.max(1), 0) is a boolean DataFrame marking the row maximums; True values (the maximums) are left untouched whereas False values become -1. You can also use a Series or another DataFrame instead of a scalar if you like.
The operation can also be done inplace (by passing inplace=True).

You can create boolean mask by comparing by eq with max by rows, then apply inverted mask:
print df
A B C
index
1 1 2 3
2 9 5 4
3 7 12 8
print df.max(axis=1)
index
1 3
2 9
3 12
dtype: int64
mask = df.eq(df.max(axis=1), axis=0)
print mask
A B C
index
1 False False True
2 True False False
3 False True False
df[~mask] = -1
print df
A B C
index
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1
All together:
df[~df.eq(df.max(axis=1), axis=0)] = -1
print df
A B C
index
1 -1 -1 3
2 9 -1 -1
3 -1 12 -1

Create an new dataframe the same size of df consisting of -1 for each value. Then use enumerate to get the first max value in a given row, using integer getting/setting of a scalar (iat).
df2 = pd.DataFrame(-np.ones(df.shape), columns=df.columns, index=df.index)
for row, col in enumerate(np.argmax(df.values, axis=1)):
df2.iat[row, col] = df.iat[row, col]
>>> df2
0 1 2
0 -1 -1 3
1 9 -1 -1
2 -1 12 -1
Timings
df = pd.DataFrame(np.random.randn(10000, 10000))
%%timeit
df2 = pd.DataFrame(-np.ones(df.shape))
for row, col in enumerate(np.argmax(df.values, axis=1)):
df2.iat[row, col] = df.iat[row, col]
1 loops, best of 3: 1.19 s per loop
%timeit df.where(df.eq(df.max(1), 0), -1)
1 loops, best of 3: 6.27 s per loop
# Using inplace=True
%timeit df.where(df.eq(df.max(1), 0), -1, inplace=True)
1 loops, best of 3: 5.58 s per loop
%timeit df[~df.eq(df.max(axis=1), axis=0)] = -1
1 loops, best of 3: 5.65 s per loop

Related

How to use lambda function on a pandas data frame via map/apply where lambda takes different values for each column

The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2

Checking condition in future rows in pandas with group by

Following is what my dataframe looks like and Expected_Output is my desired column.
Group Signal Value1 Value2 Expected_Output
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 1.0
7 2 0 3 4 1.0
For a given Group, if Signal == 1, then I am attempting to look at the next three rows(and not the current row) and check if Value1 < Value2. If that condition is true, then I return a 1 in the Expected_Output column. If for example, Value < Value2 condition is satisfied for multiple reasons as it comes within 3 next rows from Signal == 1 in both row 5 & 6(Group 2), then I am also returning a 1 in Expected_Output.
I am assuming the right combination of group by object,np.where, any, shift could be the solution but cant quite get there.
N.B:- Alexander pointed out a conflict in the comments. Ideally, a value being set due to a signal in a prior row will supersede the current row rule conflict in a given row.
If you are going to be checking lots of previous rows, multiple shifts can quickly get messy, but here it's not too bad:
s = df.groupby('Group').Signal
condition = ((s.shift(1).eq(1) | s.shift(2).eq(1) | s.shift(3).eq(1))
& df.Value1.lt(df.Value2))
df.assign(out=np.where(condition, 1, np.nan))
Group Signal Value1 Value2 out
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 1.0
7 2 0 3 4 1.0
If you're concerned about the performance of using so many shifts, I wouldn't worry too much, here's a sample on 1 million rows:
In [401]: len(df)
Out[401]: 960000
In [402]: %%timeit
...: s = df.groupby('Group').Signal
...:
...: condition = ((s.shift(1).eq(1) | s.shift(2).eq(1) | s.shift(3).eq(1))
...: & df.Value1.lt(df.Value2))
...:
...: np.where(condition, 1, np.nan)
...:
...:
94.5 ms ± 524 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Alexander identified a conflict in the rules, here is a version using a mask that fits that requirement:
s = (df.Signal.mask(df.Signal.eq(0)).groupby(df.Group)
.ffill(limit=3).mask(df.Signal.eq(1)).fillna(0))
Now you can simply use this column along with your other condition:
np.where((s.eq(1) & df.Value1.lt(df.Value2)).astype(int), 1, np.nan)
array([nan, nan, nan, 1., nan, nan, nan, 1.])
You can create an index that matches your criteria, and then use it to set the expected output to 1.
It is not clear how to treat the expected output when the rules conflict. For example, on row 6, the expected output would be 1 because it satisfied the signal criteria from row five and fits 'the subsequent three rows where value 1 < value 2'. However, it possibly conflicts with the rule that the first signal row is ignored.
idx = (df
.assign(
grp=df['Signal'].eq(1).cumsum(),
cond=df.eval('Value1 < Value2'))
.pipe(lambda df: df[df['grp'] > 0]) # Ignore data preceding first signal.
.groupby(['Group', 'grp'], as_index=False)
.apply(lambda df: df.iloc[1:4, :]) # Ignore current row, get rows 1-3.
.pipe(lambda df: df[df['cond']]) # Find rows where condition is met.
.index.get_level_values(1)
)
df['Expected_Output'] = np.nan
df.loc[idx, 'Expected_Output'] = 1
>>> df
Group Signal Value1 Value2 Expected_Output
0 1 0 3 1 NaN
1 1 1 4 2 NaN
2 1 0 7 4 NaN
3 1 0 8 9 1.0
4 1 0 5 3 NaN
5 2 1 3 6 NaN
6 2 1 1 2 NaN # <<< Intended difference vs. "expected"
7 2 0 3 4 1.0

Return All Values of Column A and Put them in Column B until Specific Value Is reached

I am still having trouble with with this and nothing seems to work for me. I have a data frame with two columns. I am trying to return all of the values in column A in a new column, B. However, I want to loop through column A and stop returning those values and instead return 0 when the cumulative sum reaches 8 or the next value would make it greater than 8.
df max_val = 8
A
1
2
2
3
4
5
1
The output should look something like this
df max_val = 8
A B
1 1
2 2
2 2
3 3
4 0
5 0
1 0
I thought something like this
def func(x):
if df['A'].cumsum() <= max_val:
return x
else:
return 0
This doesn't work:
df['B'] = df['A'].apply(func, axis =1 )
Neither does this:
df['B'] = func(df['A'])
You can use Series.where:
df['B'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
print (df)
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0
Approach #1 One approach using np.where -
df['B']= np.where((df.A.cumsum()<=max_val), df.A ,0)
Sample output -
In [145]: df
Out[145]:
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0
Approach #2 Another using array-initialization -
def app2(df,max_val):
a = df.A.values
colB = np.zeros(df.shape[0],dtype=a.dtype)
idx = np.searchsorted(a.cumsum(),max_val, 'right')
colB[:idx] = a[:idx]
df['B'] = colB
Runtime test
Seems like #jezrael's pd.where based one is close one, so timing against it on a bigger dataset -
In [293]: df = pd.DataFrame({'A':np.random.randint(0,9,(1000000))})
In [294]: max_val = 1000000
# #jezrael's soln
In [295]: %timeit df['B1'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
100 loops, best of 3: 8.22 ms per loop
# Proposed in this post
In [296]: %timeit df['B2']= np.where((df.A.cumsum()<=max_val), df.A ,0)
100 loops, best of 3: 6.45 ms per loop
# Proposed in this post
In [297]: %timeit app2(df, max_val)
100 loops, best of 3: 4.47 ms per loop
df['B']=[x if x<=8 else 0 for x in df['A'].cumsum()]
df
Out[7]:
A B
0 1 1
1 2 3
2 2 5
3 3 8
4 4 0
5 5 0
6 1 0
Why don't you add values to a variable like this :
for i in range(len(df)):
if A<max_val:
return x
else:
return 0
A=A+df[i]
Splitting in multiple lines
import pandas as pd
A=[1,2,2,3,4,5,1]
MAXVAL=8
df=pd.DataFrame(data=A,columns=['A'])
df['cumsumA']=df['A'].cumsum()
df['B']=df['cumsumA']*(df['cumsumA']<MAXVAL).astype(int)
You can then drop the 'cumsumA' column
The below will work fine -
import numpy as np
max_val = 8
df['B'] = np.where(df['A'].cumsum() <= max_val , df['A'],0)
I hope this helps.
just a way to do it with .loc:
df['c'] = df['a'].cumsum()
df['b'] = df['a']
df['b'].loc[df['c'] > 8] = 0

When slicing a 1 row pandas dataframe the slice becomes a series

Why when I slice a pandas dataframe containing only 1 row, the slice becomes a pandas series?
How can I keep it a dataframe?
df=pd.DataFrame(data=[[1,2,3]],columns=['a','b','c'])
df
Out[37]:
a b c
0 1 2 3
a=df.iloc[0]
a
Out[39]:
a 1
b 2
c 3
Name: 0, dtype: int64
To avoid the intermediate step of re-converting back to a DataFrame, use double brackets when indexing:
a = df.iloc[[0]]
print(a)
a b c
0 1 2 3
Speed:
%timeit df.iloc[[0]]
192 µs per loop
%timeit df.loc[0].to_frame().T
468 µs per loop
Or you can slice by index
a=df.iloc[df.index==0]
a
Out[1782]:
a b c
0 1 2 3
Use to_frame() and T to transpose:
df.loc[0].to_frame()
0
a 1
b 2
c 3
and
df.loc[0].to_frame().T
a b c
0 1 2 3
OR
Option #2 use double brackets [[]]
df.iloc[[0]]
a b c
0 1 2 3

How to drop rows from a data frame if all columns equal to each other?

I understand how to drop rows if a given column is equal to some value e.g
df = df.drop(df[<some boolean condition>].index)
but how do you drop a row if the columns are all equal to each other? Is there a way to do this without specifying the column names?
You can use apply method to loop through rows and make a logical series indicating if each contains unique values, and use boolean series to remove corresponding rows:
df[df.apply(lambda r: r.nunique() != 1, 1)]
df = pd.DataFrame({"A": [1,2,3,3,3,4,5], "B": [1,3,4,4,3,5,1]})
In [867]:
df[df.apply(lambda r: r.nunique() != 1, 1)]
Out[867]:
A B
1 2 3
2 3 4
3 3 4
5 4 5
6 5 1
You can just compare the first column against the entire df using .eq and specify axis=0 and call all on the result and invert using ~:
In [158]:
df = pd.DataFrame({'a':np.arange(5), 'b':[0,0,2,2,4]})
df
Out[158]:
a b
0 0 0
1 1 0
2 2 2
3 3 2
4 4 4
In [159]:
df[~df.eq(df['a'], axis=0).all(axis=1)]
Out[159]:
a b
1 1 0
3 3 2
If you look at the boolean mask:
In [160]:
df.eq(df['a'], axis=0)
Out[160]:
a b
0 True True
1 True False
2 True True
3 True False
4 True True
You can see it is true for the rows that meet the condition so calling all(axis=1) returns a 1-D boolean mask:
In [161]:
df.eq(df['a'], axis=0).all(axis=1)
Out[161]:
0 True
1 False
2 True
3 False
4 True
dtype: bool

Categories

Resources