Boolean slicing by comparing one to many columns in pandas - python

How can i compare a column to all other columns and obtain a boolean series to slice the dataframe by using i loc?
import numpy as np
import pandas as pd
a = np.random.normal(1,10,(10,1))
b = np.random.normal(1,5,(10,1))
c = np.random.normal(1,5,(10,1))
d = np.random.normal(1,5,(10,1))
e = np.append(a,b, axis = 1)
e = np.append(e,c, axis = 1)
e = np.append(e,d, axis = 1)
df = pd.DataFrame(data = e, columns=['a','b','c','d'])
a b c d
0 4.043832 -1.672865 -0.401864 3.073481
1 4.828796 -0.830688 3.652347 -1.780346
2 13.055145 5.730707 -2.305093 -4.566279
3 6.589498 -0.525029 -1.077942 -3.850963
4 5.273932 -1.003112 0.393002 -0.415573
5 -7.872004 -2.506250 1.725281 6.676886
6 -4.797119 6.448990 0.254142 -7.374601
7 8.610763 8.075350 13.043584 12.768633
8 -10.871154 2.152322 2.093089 11.570059
9 -22.148239 1.493870 3.649696 2.455621
df.loc[df.a > df.b]
Will give the desired result, but only for a 1:1 comparison
a b c d
0 4.043832 -1.672865 -0.401864 3.073481
1 4.828796 -0.830688 3.652347 -1.780346
2 13.055145 5.730707 -2.305093 -4.566279
3 6.589498 -0.525029 -1.077942 -3.850963
4 5.273932 -1.003112 0.393002 -0.415573
7 8.610763 8.075350 13.043584 12.768633
My Approach was like this:
S = ['b','c','d']
(df.a > df[S]).any(axis = 1)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
But unfortunately, the series is somehow False for all rows. How can i solve this Problem?

Using lt
df[S].lt(df.a,0).any(axis=1)
Out[808]:
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool

Given that you're using this for a mask, you could simply add another axis to the underlying ndarray to allow for broadcasting. This should be somewhat faster, depending on the size of your DataFrame.
(df[S].values < df.a.values[:,None]).any(1)

Related

Pandas: How to create a column that indicates when a value is present in another column a set number of rows in advance?

I'm trying to ascertain how I can create a column that indicates in advance (X rows) when the next occurrence of a value in another column will occur with pandas that in essence performs the following functionality (In this instance X = 3):
df
rowid event indicator
1 True 1 # Event occurs
2 False 0
3 False 0
4 False 1 # Starts indicator
5 False 1
6 True 1 # Event occurs
7 False 0
Apart from doing a iterative/recursive loop through every row:
i = df.index[df['event']==True]
dfx = [df.index[z-X:z] for z in i]
df['indicator'][dfx]=1
df['indicator'].fillna(0)
However this seems inefficient, is there a more succinct method of achieving the aforementioned example? Thanks
Here's a NumPy based approach using flatnonzero:
X = 3
# ndarray of indices where indicator should be set to one
nd_ixs = np.flatnonzero(df.event)[:,None] - np.arange(X-1, -1, -1)
# flatten the indices
ixs = nd_ixs.ravel()
# filter out negative indices an set to 1
df['indicator'] = 0
df.loc[ixs[ixs>=0], 'indicator'] = 1
print(df)
rowid event indicator
0 1 True 1
1 2 False 0
2 3 False 0
3 4 False 1
4 5 False 1
5 6 True 1
6 7 False 0
Where nd_ixs is obtained through the broadcasted subtraction of the indices where event is True and an arange up to X:
print(nd_ixs)
array([[-2, -1, 0],
[ 3, 4, 5]], dtype=int64)
A pandas and numpy solution:
# Make a variable shift:
def var_shift(series, X):
return [series] + [series.shift(i) for i in range(-X + 1, 0, 1)]
X = 3
# Set indicator to default to 1
df["indicator"] = 1
# Use pd.Series.where and np.logical_or with the
# var_shift function to get a bool array, setting
# 0 when False
df["indicator"] = df["indicator"].where(
np.logical_or.reduce(var_shift(df["event"], X)),
0,
)
# rowid event indicator
# 0 1 True 1
# 1 2 False 0
# 2 3 False 0
# 3 4 False 1
# 4 5 False 1
# 5 6 True 1
# 6 7 False 0
In [77]: np.logical_or.reduce(var_shift(df["event"], 3))
Out[77]: array([True, False, False, True, True, True, nan], dtype=object)

Pandas change values in a groupby

I've a df like
a flag
0 1 False
1 0 False
2 1 False
3 0 False
4 0 False
and lets say I want to randomly put some True on every group in column a in order to obtain
a flag
0 1 True
1 0 True
2 1 True
3 0 False
4 0 True
So far I'm able to do so with the following code
import pandas as pd
import numpy as np
def rndm_flag(ds, n):
l = len(ds)
n = min([l, n])
vec = ds.sample(n).index
ds["flag"] = np.where(ds.index.isin(vec),
True, ds["flag"])
return(ds)
N = 5
df = pd.DataFrame({"a":np.random.randint(0,2,N),
"flag":[False]*N})
dfs = list(df.groupby("a"))
dfs = [x[1] for x in dfs]
df = pd.concat([rndm_flag(x, 2) for x in dfs])
df.sort_index(inplace=True)
But I'm wondering if there is an alternative (more elegant) way to do so.
This should give you some idea:
## create dataframe
df = pd.DataFrame({'a':[1,0,1,0,0], 'b':False})
## create flag
d['b'] = d.groupby('a').transform(lambda x: (np.random.choice([True, False], len(x), p = [0.65,0.35])))
print(d)
a b
0 1 False
1 0 True
2 1 False
3 0 True
4 0 True

Adding a count to prior cell value in Pandas

in Pandas I am looking to add a value in one column 'B' depending on the boolean values from another column 'A'. So if 'A' is True then start counting (i.e. adding a one each new line) as long as 'A' is false. When 'A' is True reset and start counting again. I managed to do this with a 'for' loop but this is very time consuming. I am wondering if there is no more time efficient solution?
the result should look like this:
Date A B
01.2010 False 0
02.2010 True 1
03.2010 False 2
04.2010 False 3
05.2010 True 1
06.2010 False 2
You can use cumsum with groupby and cumcount:
print df
Date A
0 1.201 False
1 1.201 True
2 1.201 False
3 2.201 True
4 3.201 False
5 4.201 False
6 5.201 True
7 6.201 False
roll = df.A.cumsum()
print roll
0 0
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Name: A, dtype: int32
df['B'] = df.groupby(roll).cumcount() + 1
#if in first values are False, output is 0
df.loc[roll == 0 , 'B'] = 0
print df
Date A B
0 1.201 False 0
1 1.201 True 1
2 1.201 False 2
3 2.201 True 1
4 3.201 False 2
5 4.201 False 3
6 5.201 True 1
7 6.201 False 2
thanks, I got the solution from another post similar to this:
rolling_count = 0
def set_counter(val):
if val == False:
global rolling_count
rolling_count +=1
else:
val == True
rolling_count = 1
return rolling_count
df['B'] = df['A'].map(set_counter)

Making new column in pandas DataFrame based on filter

Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool

Assign value to subset of rows in Pandas dataframe

I want to assign values based on a condition on index in Pandas DataFrame.
class test():
def __init__(self):
self.l = 1396633637830123000
self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = np.nan
for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
self.dfa.ix[beg:end]['true'] = True
self.dfa.ix[beg:end]['idx'] = i
def do(self):
self.update()
print self.dfa
t = test()
t.do()
Result:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True NaN
1396633637830123002 4 5 True NaN
1396633637830123003 6 7 True NaN
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True NaN
1396633637830123007 14 15 True NaN
1396633637830123008 16 17 True NaN
1396633637830123009 18 19 True NaN
The true column is correctly assigned, while the idx column is not. Futhermore, this seems to depend on how the columns are initialized because if I do:
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = False
also the true column does not get properly assigned.
What am I doing wrong?
p.s. the expected result is:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True 0
1396633637830123002 4 5 True 0
1396633637830123003 6 7 True 0
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True 1
1396633637830123007 14 15 True 1
1396633637830123008 16 17 True 1
1396633637830123009 18 19 True 1
Edit: I tried assigning using both loc and iloc but it doesn't seem to work:
loc:
self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i
iloc:
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i
You are chain indexing, see here. The warning is not guaranteed to happen.
You should prob just do this. No real need to actually track the index in b, btw.
In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))
In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])
In [46]: dfa['in_b'] = False
In [47]: for i, s in dfb.iterrows():
....: dfa.loc[s['beg']:s['end'],'in_b'] = True
....:
or this if you have non-integer dtypes
In [36]: for i, s in dfb.iterrows():
dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True
In [48]: dfa
Out[48]:
A B in_b
1396633637830123000 0 1 False
1396633637830123001 2 3 True
1396633637830123002 4 5 True
1396633637830123003 6 7 True
1396633637830123004 8 9 False
1396633637830123005 10 11 False
1396633637830123006 12 13 True
1396633637830123007 14 15 True
1396633637830123008 16 17 True
1396633637830123009 18 19 True
[10 rows x 3 columns
If b is HUGE this might not be THAT performant.
As an aside, these look like nanosecond times. Can be more friendly by converting them.
In [49]: pd.to_datetime(dfa.index)
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None

Categories

Resources