Pandas change values in a groupby - python

I've a df like
a flag
0 1 False
1 0 False
2 1 False
3 0 False
4 0 False
and lets say I want to randomly put some True on every group in column a in order to obtain
a flag
0 1 True
1 0 True
2 1 True
3 0 False
4 0 True
So far I'm able to do so with the following code
import pandas as pd
import numpy as np
def rndm_flag(ds, n):
l = len(ds)
n = min([l, n])
vec = ds.sample(n).index
ds["flag"] = np.where(ds.index.isin(vec),
True, ds["flag"])
return(ds)
N = 5
df = pd.DataFrame({"a":np.random.randint(0,2,N),
"flag":[False]*N})
dfs = list(df.groupby("a"))
dfs = [x[1] for x in dfs]
df = pd.concat([rndm_flag(x, 2) for x in dfs])
df.sort_index(inplace=True)
But I'm wondering if there is an alternative (more elegant) way to do so.

This should give you some idea:
## create dataframe
df = pd.DataFrame({'a':[1,0,1,0,0], 'b':False})
## create flag
d['b'] = d.groupby('a').transform(lambda x: (np.random.choice([True, False], len(x), p = [0.65,0.35])))
print(d)
a b
0 1 False
1 0 True
2 1 False
3 0 True
4 0 True

Related

Create a column Counting the consecutive True values on multi-index

Let df be a dataframe of boolean values with a two column index. I want to calculate the value for every id. For example, this is how it would look on this specific case.
value consecutive
id Week
1 1 True 1
1 2 True 2
1 3 False 0
1 4 True 1
1 5 True 2
2 1 False 0
2 2 False 0
2 3 True 1
This is my solution:
def func(id,week):
M = df.loc[id]
M= df.loc[id][:week+1]
consecutive_list = list()
S=0
for index,row in M.iterrows():
if row['value']:
S+=1
else:
S=0
consecutive_list.append(S)
return consecutive_list[-1]
Then we generate the column "consecutive" as a list on the following way:
Consecutive_list = list()
for k in df.index:
id = k[0]
week=k[1]
Consecutive_list.append(func(id,week))
df['consecutive'] = Consecutive_list
I would like to know if there is a more Pythonic way to do this.
EDIT: I wrote the "consecutive" column in order to show what I expect this to be.
If you are trying to add the consecutive column to the df, this should work:
df.assign(consecutive = df['value'].groupby(df['value'].diff().ne(0).cumsum()).cumsum())
Output:
value consecutive
1 a True 1
b True 2
2 a False 0
b True 1
3 a True 2
b False 0
4 a False 0
b True 1

Count number of consecutive True in column, restart when False

I work with the following column in a pandas df:
A
True
True
True
False
True
True
I want to add column B that counts the number of consecutive "True" in A. I want to restart everytime a "False" comes up. Desired output:
A B
True 1
True 2
True 3
False 0
True 1
True 2
Using cumsum identify the blocks of rows where the values in column A stays True, then group the column A on these blocks and calculate cumulative sum to assign ordinal numbers
df['B'] = df['A'].groupby((~df['A']).cumsum()).cumsum()
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2
Using a simple & native approach
(For a small code sample it worked fine)
import pandas as pd
df = pd.DataFrame({'A': [True, False, True, True, True, False, True, True]})
class ToNums:
counter = 0
#staticmethod
def convert(bool_val):
if bool_val:
ToNums.counter += 1
else:
ToNums.counter = 0
return ToNums.counter
df['B'] = df.A.map(ToNums.convert)
df
A B
0 True 1
1 False 0
2 True 1
3 True 2
4 True 3
5 False 0
6 True 1
7 True 2
Here's an example
v=0
for i,val in enumerate(df['A']):
if val =="True":
df.loc[i,"C"]= v =v+1
else:
df.loc[i,"C"]=v=0
df.head()
This will give the desired output
A C
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
You can use a combination of groupby, cumsum, and cumcount
df['B'] = (df.groupby((df['A']&
~df['A'].shift(1).fillna(False) # row is True and next is False
)
.cumsum() # make group id
)
.cumcount().add(1) # make cumulated count
*df['A'] # multiply by 0 where initially False, 1 otherwise
)
output:
A B
0 True 1
1 True 2
2 True 3
3 False 0
4 True 1
5 True 2

Pandas: How to create a column that indicates when a value is present in another column a set number of rows in advance?

I'm trying to ascertain how I can create a column that indicates in advance (X rows) when the next occurrence of a value in another column will occur with pandas that in essence performs the following functionality (In this instance X = 3):
df
rowid event indicator
1 True 1 # Event occurs
2 False 0
3 False 0
4 False 1 # Starts indicator
5 False 1
6 True 1 # Event occurs
7 False 0
Apart from doing a iterative/recursive loop through every row:
i = df.index[df['event']==True]
dfx = [df.index[z-X:z] for z in i]
df['indicator'][dfx]=1
df['indicator'].fillna(0)
However this seems inefficient, is there a more succinct method of achieving the aforementioned example? Thanks
Here's a NumPy based approach using flatnonzero:
X = 3
# ndarray of indices where indicator should be set to one
nd_ixs = np.flatnonzero(df.event)[:,None] - np.arange(X-1, -1, -1)
# flatten the indices
ixs = nd_ixs.ravel()
# filter out negative indices an set to 1
df['indicator'] = 0
df.loc[ixs[ixs>=0], 'indicator'] = 1
print(df)
rowid event indicator
0 1 True 1
1 2 False 0
2 3 False 0
3 4 False 1
4 5 False 1
5 6 True 1
6 7 False 0
Where nd_ixs is obtained through the broadcasted subtraction of the indices where event is True and an arange up to X:
print(nd_ixs)
array([[-2, -1, 0],
[ 3, 4, 5]], dtype=int64)
A pandas and numpy solution:
# Make a variable shift:
def var_shift(series, X):
return [series] + [series.shift(i) for i in range(-X + 1, 0, 1)]
X = 3
# Set indicator to default to 1
df["indicator"] = 1
# Use pd.Series.where and np.logical_or with the
# var_shift function to get a bool array, setting
# 0 when False
df["indicator"] = df["indicator"].where(
np.logical_or.reduce(var_shift(df["event"], X)),
0,
)
# rowid event indicator
# 0 1 True 1
# 1 2 False 0
# 2 3 False 0
# 3 4 False 1
# 4 5 False 1
# 5 6 True 1
# 6 7 False 0
In [77]: np.logical_or.reduce(var_shift(df["event"], 3))
Out[77]: array([True, False, False, True, True, True, nan], dtype=object)

Boolean slicing by comparing one to many columns in pandas

How can i compare a column to all other columns and obtain a boolean series to slice the dataframe by using i loc?
import numpy as np
import pandas as pd
a = np.random.normal(1,10,(10,1))
b = np.random.normal(1,5,(10,1))
c = np.random.normal(1,5,(10,1))
d = np.random.normal(1,5,(10,1))
e = np.append(a,b, axis = 1)
e = np.append(e,c, axis = 1)
e = np.append(e,d, axis = 1)
df = pd.DataFrame(data = e, columns=['a','b','c','d'])
a b c d
0 4.043832 -1.672865 -0.401864 3.073481
1 4.828796 -0.830688 3.652347 -1.780346
2 13.055145 5.730707 -2.305093 -4.566279
3 6.589498 -0.525029 -1.077942 -3.850963
4 5.273932 -1.003112 0.393002 -0.415573
5 -7.872004 -2.506250 1.725281 6.676886
6 -4.797119 6.448990 0.254142 -7.374601
7 8.610763 8.075350 13.043584 12.768633
8 -10.871154 2.152322 2.093089 11.570059
9 -22.148239 1.493870 3.649696 2.455621
df.loc[df.a > df.b]
Will give the desired result, but only for a 1:1 comparison
a b c d
0 4.043832 -1.672865 -0.401864 3.073481
1 4.828796 -0.830688 3.652347 -1.780346
2 13.055145 5.730707 -2.305093 -4.566279
3 6.589498 -0.525029 -1.077942 -3.850963
4 5.273932 -1.003112 0.393002 -0.415573
7 8.610763 8.075350 13.043584 12.768633
My Approach was like this:
S = ['b','c','d']
(df.a > df[S]).any(axis = 1)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
But unfortunately, the series is somehow False for all rows. How can i solve this Problem?
Using lt
df[S].lt(df.a,0).any(axis=1)
Out[808]:
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
Given that you're using this for a mask, you could simply add another axis to the underlying ndarray to allow for broadcasting. This should be somewhat faster, depending on the size of your DataFrame.
(df[S].values < df.a.values[:,None]).any(1)

Making new column in pandas DataFrame based on filter

Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool

Categories

Resources