Groupby with boolean condition True in one of the columns in Pandas - python

This is my dataframe which I want to use groupby
Value Boolean1 Boolean2
5.175603 False False
5.415855 False False
5.046997 False False
4.607749 True False
5.140482 False False
1.796552 False False
0.139924 False True
4.157981 False True
4.893860 False False
5.091573 False False
6 True False
6.05 False False
I want to use groupby with Boolean1 and Boolean2 column. The groupby continues from False to unless it finds True and it checks in both column and then next False to True again. If there is nomore True, then it can ignore rest of the False (values corresponding to it) or it can be there
I want to achieve similar to this.
Value Boolean1 Boolean2
This is one group
5.175603 False False
5.415855 False False
5.046997 False False
4.607749 True False
This is another one
5.140482 False False
1.796552 False False
0.139924 False True
4.157981 False True
And this is another one
4.893860 False False
5.091573 False False
6 True False

My idea is check Falses in both columns before at least one True column:
#chain condition together by OR and invert
m = ~(df['Boolean1'] | df['Boolean2'])
#get consecutive groups with AND for filter only Trues
#(because inverting, it return False in both cols)
s = (m.ne(m.shift()) & m).cumsum()
for i, x in df.groupby(s):
print (x)
dtype: int32
Value Boolean1 Boolean2
0 5.175603 False False
1 5.415855 False False
2 5.046997 False False
3 4.607749 True False
Value Boolean1 Boolean2
4 5.140482 False False
5 1.796552 False False
6 0.139924 False True
7 4.157981 False True
Value Boolean1 Boolean2
8 4.893860 False False
9 5.091573 False False
10 6.000000 True False
Value Boolean1 Boolean2
11 6.05 False False
Detail:
print (m)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 True
dtype: bool
print (s)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 4
dtype: int32

Related

How to satisfy the condition of 2 columns different rows at the same time

My logic is like this:
cond2 column is true before expected column, and cond1 column is true before cond2 column, then expected column can be true
input
import pandas as pd
import numpy as np
d={'cond1':[False,False,True,False,False,False,False,True,False,False],'cond2':[False,True,False,True,True,False,False,False,True,False]}
df = pd.DataFrame(d)
expected result table
cond1 cond2 expected
0 FALSE FALSE
1 FALSE TRUE
2 TRUE FALSE
3 FALSE TRUE
4 FALSE TRUE
5 FALSE FALSE TRUE
6 FALSE FALSE TRUE
7 TRUE FALSE
8 FALSE TRUE
9 FALSE FALSE TRUE
I have such an idea
get the number of lines from cond1 is true to the present, and then use the cumsum function to calculate the number of lines where cond2 is true is greater than 0
But how to get the number of lines from cond1 is true to the present
The description is not fully clear. It looks like you need a cummax per group starting with True in cond1:
m = df.groupby(df['cond1'].cumsum())['cond2'].cummax()
df['expected'] = df['cond2'].ne(m)
Output:
cond1 cond2 expected
0 False False False
1 False True False
2 True False False
3 False True False
4 False True False
5 False False True
6 False False True
7 True False False
8 False True False
9 False False True
It's not very clear what you're looking for~
df['expected'] = ((df.index > df.idxmax().max())
& ~df.any(axis=1))
# Output:
cond1 cond2 expected
0 False False False
1 False True False
2 True False False
3 False True False
4 False True False
5 False False True
6 False False True
7 True False False
8 False True False
9 False False True

Selecting Dataframe row from a condition to another condition

I have a dataframe with two columns:
A B
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 False True
7 False False
8 False False
9 False False
10 True False
11 False False
12 False False
I would like to create a new column "C" with Boolean values, that turns on (=True) each time B turns on and turns of each time A turns on (ex: here between index 6 to index 10)
Ex: for this df, the output will be:
A B C
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True True
7 False False True
8 False False True
9 False False True
10 True False True
11 False False False
12 False False False
I wrote this code with a for loop and a "switch", but I'm pretty sure there will be faster and easier solution to do the same thing for large dataframes. I appreciate your help.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [False,False,False,True,False,False,False,False,False,False,True,False,False],
'B': [False,False,False,False,False,False,True,False,False,False,False,False,False]
})
df["C"]=0
switch=False
for i in df.index :
if df.B.iloc[i]:
switch=True
if switch:
df.C.iloc[i]=True
else:
df.C.iloc[i]=False
if df.A.iloc[i]:
switch=False
print(df)
Alternative approach using ffill
df.loc[df['A'],'C'] = False
df.loc[df['B'],'C'] = True
df['C'] = df['C'].ffill().fillna(False) #start "off"
Combine the two columns, subtract 1, filter out negative and even numbers:
x = (df['A'] | df['B']).cumsum().sub(1)
df['C'] = (x >= 0) & (x % 2 == 1)
Output:
>>> df
A B C
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True True <
7 False False True <
8 False False True <
9 False False True <
10 True False False
11 False False False
12 False False False

Is there a way to select interior True values for portions of a DataFrame?

I have a DataFrame that looks like the following:
df = pd.DataFrame({'a':[True]*5+[False]*5+[True]*5,'b':[False]+[True]*3+[False]+[True]*5+[False]*4+[True]})
a b
0 True False
1 True True
2 True True
3 True True
4 True False
5 False True
6 False True
7 False True
8 False True
9 False True
10 True False
11 True False
12 True False
13 True False
14 True False
How can I select blocks where column a is True only when the interior values over the same rows for column b are True?
I know that I could find break apart the DataFrame into consecutive True regions, and apply a function to each DataFrame chunk, but this is for a much larger problem with 10 million+ rows, and I don't think such a solution would scale up very well.
My expected output would be the following:
a b c
0 True False True
1 True True True
2 True True True
3 True True True
4 True False True
5 False True False
6 False True False
7 False True False
8 False True False
9 False True False
10 True False False
11 True False False
12 True False False
13 True False False
14 True True False
You can do a groupby on the a values and then look at the b values in a function, like this:
groupby_consec_a = df.groupby(df.a.diff().ne(0).cumsum())
all_interior = lambda x: x.iloc[1:-1].all()
df['c'] = df.a & groupby_consec_a.b.transform(all_interior)
Try out whether it's fast enough on your data. If not, the lambda will have to be replaced by pandas functions, but that will be more code.

Pandas get one hot encodings from a column as booleans

I'm considering a Pandas Dataframe. I would like to find an efficient way in which the second Dataframe is created.
import pandas as pd
data = {"column":[0,1,2,0,1,2,0]}
df = pd.DataFrame(data)
column
0
1
2
0
1
2
0
column0 column1 column2
true false false
false true false
false false true
true false false
false true false
false false true
true false false
This is a get_dummies problem, but you will additionally need to specify dtype=bool to get columns of bools:
pd.get_dummies(df['column'], dtype=bool)
0 1 2
0 True False False
1 False True False
2 False False True
3 True False False
4 False True False
5 False False True
6 True False False
pd.get_dummies(df['column'], dtype=bool).dtypes
0 bool
1 bool
2 bool
dtype: object
# carbon copy of expected output
dummies = pd.get_dummies(df['column'], dtype=bool)
dummies[:] = np.where(pd.get_dummies(df['column'], dtype=bool), 'true', 'false')
dummies.add_prefix('column')
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
I also use get_dummies as cs95. However, I use str.get_dummies and concat the word column before get_dummies. Finally, replace
('column'+df.column.astype(str)).str.get_dummies().replace({1:'true', 0:'false'})
Out[2164]:
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
factorize and slice assignment
i, u = pd.factorize(df.column)
a = np.empty((len(i), len(u)), '<U5')
a.fill('false')
a[np.arange(len(i)), i] = 'true'
pd.DataFrame(a).add_prefix('column')
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false

How to change the bool values in pd.Series if the count is more than some value in pandas

I have the following bool series, I am trying to find the count, I have to check the False values, if the count of the False values in consecutive series is greater than 2 then let it be false but if the count is less than equal to 2 then i have to inverse them, from False to True
Expected output: Like wise first time the False is repeating for two times which means it will change into True, but if we see after true values the false is repeating again more than two times so those values will stay as false,
How can I perform this using Pandas functions?
True
True
True
True
True
False
False
True
True
True
True
True
True
True
False
False
False
False
False
False
True
True
True
True
Let us try something different
s=df.cumsum().mask(df)
df=df.mask(s.isin(s.value_counts()[s.value_counts()<=2].index),True)
df
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 True
21 True
22 True
23 True
Name: a, dtype: bool
Try using groupby and cumsum to generate unique groups of False, then get the count of each group, if that count is less than three invert that group of series using ~ and apply back to series with mask:
s.mask(s.groupby((s).cumsum().where(~s)).transform('count') < 3, ~s)
Output:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 True
21 True
22 True
23 True
Name: 0, dtype: bool

Categories

Resources