Group boolean values in Pandas Dataframe - python

i have a Dataframe with a random series of True, False in a column:
import pandas as pd
df = pd.DataFrame(data={'A':[True, False, True, True, False, False, True, False, True, False, False]})
df
A
0
True
1
False
2
True
3
True
4
False
5
False
6
True
7
False
8
True
9
False
10
False
and i want this: (Dont know how to explain it with easy words)
A
B
0
True
1
1
False
2
2
True
2
3
True
2
4
False
3
5
False
3
6
True
3
7
False
4
8
True
4
9
False
5
10
False
5
I've tried something with the following commands, but without success:
df[A].shift()
df[A].diff()
df[A].eq()
Many thanks for your help.
Matthias

IIUC, you can try:
df['B'] = (df.A.shift() & ~df.A).cumsum() + 1
# OR df['B'] = (df.A.shift() & ~df.A).cumsum().add(1)
OUTPUT:
A B
0 True 1
1 False 2
2 True 2
3 True 2
4 False 3
5 False 3
6 True 3
7 False 4
8 True 4
9 False 5
10 False 5

A little bit logic with diff
(~df.A.astype(int).diff().ne(-1)).cumsum()+1
Out[234]:
0 1
1 2
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 5
10 5
Name: A, dtype: int32

Related

Pandas, create column using previous new column value

I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8

Fill grouping variable pandas dataframe

I have a pandas dataframe with an id column called doc_ID and a boolean column that reports if a certain value is below a threshold, like so:
df = pd.DataFrame({'doc_ID': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'below_threshold': [False, False, False, False, True, False, False, True, False, False,
False, False, False, False, False, False, False, True, False, False]})
I'm trying to create a new grouping id within each doc_ID that would extend from the first False value in order until and including the first True value. Something like this
doc_ID below_threshold new_group
0 1 False 1
1 1 False 1
2 1 False 1
3 1 False 1
4 1 True 1
5 1 False 2
6 2 False 3
7 2 True 3
8 2 False 4
9 2 False 4
10 2 False 4
11 2 False 4
12 3 False 5
13 3 False 5
14 3 False 5
15 3 False 5
16 3 False 5
17 3 True 5
18 3 False 6
19 3 False 6
IIUC, use:
m1 = ~df['below_threshold']
m2 = df.groupby('doc_ID')['below_threshold'].shift(fill_value=True)
df['new_group'] = (m1&m2).cumsum()
Output:
doc_ID below_threshold new_group
0 1 False 1
1 1 False 1
2 1 False 1
3 1 False 1
4 1 True 1
5 1 False 2
6 2 False 3
7 2 True 3
8 2 False 4
9 2 False 4
10 2 False 4
11 2 False 4
12 3 False 5
13 3 False 5
14 3 False 5
15 3 False 5
16 3 False 5
17 3 True 5
18 3 False 6
19 3 False 6

Pandas expanding count when column value changes

I have a dataframe
df = pd.DataFrame({'A': [True, True, False, False, False, False, False, True, True, True, False]})
A
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 True
10 False
I want to apply a count which expands using two criteria: each time column A changes value, or when succeeding rows are False. If succeeding rows are True the count should hold static. The desired output would be:
A B
0 True 1
1 True 1
2 False 2
3 False 3
4 False 4
5 False 5
6 False 6
7 True 7
8 True 7
9 True 7
10 False 8
I've faffed with a whole range of pandas functions and can't seem to figure it out.
Try:
1st condition: each time column A changes value: df.ne(df.shift()
2nd condition: when succeeding rows are False: df.eq(False)
and do a cumsum over the boolean mask:
>>> (df.ne(df.shift()) | df.eq(False)).cumsum()
A
0 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 7
9 7
10 8

Get list of column names for columns that contain negative values

This is a simple question but I have found "slicing" DataFrames in Pandas frustrating, coming from R.
I have a DataFrame df below with 7 columns:
df
Out[77]:
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 8 8 -1 2 1 7 4
1 6 6 1 7 5 -1 3
2 2 5 4 2 2 8 1
3 -1 -1 7 2 3 2 0
4 6 6 4 2 0 5 2
5 -1 5 7 1 5 8 2
6 7 1 -1 0 1 8 1
7 6 2 4 1 2 6 1
8 3 4 4 5 8 -1 4
9 4 4 3 7 7 4 5
How do I slice df in such a way that it produces a list of columns that contain at least one negative number?
You can select them by building an appropriate Series and then using it to index into df:
>>> df < 0
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 False False True False False False False
1 False False False False False True False
2 False False False False False False False
3 True True False False False False False
4 False False False False False False False
5 True False False False False False False
6 False False True False False False False
7 False False False False False False False
8 False False False False False True False
9 False False False False False False False
>>> (df < 0).any()
fld1 True
fld2 True
fld3 True
fld4 False
fld5 False
fld6 True
fld7 False
dtype: bool
and then
>>> df.columns[(df < 0).any()]
Index(['fld1', 'fld2', 'fld3', 'fld6'], dtype='object')
or
>>> df.columns[(df < 0).any()].tolist()
['fld1', 'fld2', 'fld3', 'fld6']
depending on what data structure you want. We can also use this io index into df directly:
>>> df.loc[:,(df < 0).any()]
fld1 fld2 fld3 fld6
0 8 8 -1 7
1 6 6 1 -1
2 2 5 4 8
3 -1 -1 7 2
4 6 6 4 5
5 -1 5 7 8
6 7 1 -1 8
7 6 2 4 6
8 3 4 4 -1
9 4 4 3 4

creating a boolean indexing in for loop in pandas

I would like to get a subset of a pandas dataframe with boolean indexing.
The condition I want to test is like (df[var_0] == value_0) & ... & (df[var_n] == value_n) where the number n of variables involved can change. As a result I am not able to write :
df = df[(df[var_0] == value_0) & ... & (df[var_n] == value_n)]
I could do something like :
for k in range(0,n+1) :
df = df[df[var_k] == value_k]
(with some try catch to make sure it works if the dataframe goes empty), but that does not seems very efficient.
Has anyone an idea on how to write that in a clean pandas formulation ?
The isin method should work for you here.
In [7]: df
Out[7]:
a b c d e
0 6 3 1 9 6
1 8 9 5 7 2
2 6 4 7 4 3
3 4 8 0 0 5
4 4 4 2 3 4
5 2 5 9 0 9
6 4 8 2 9 1
7 3 0 8 9 7
8 0 5 9 9 6
9 0 7 8 4 8
[10 rows x 5 columns]
In [8]: vals = {'a': [3], 'b': [0], 'c': [8], 'd': [9], 'e': [7]}
In [9]: df.isin(vals)
Out[9]:
a b c d e
0 False False False True False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
5 False False False False False
6 False False False True False
7 True True True True True
8 False False False True False
9 False False True False False
[10 rows x 5 columns]
In [10]: df[df.isin(vals).all(1)]
Out[10]:
a b c d e
7 3 0 8 9 7
[1 rows x 5 columns]
The values in the vals dict need to be a collection, so I put them into length 1 lists. It's possibly that query can do this too.

Categories

Resources