Fill grouping variable pandas dataframe - python

I have a pandas dataframe with an id column called doc_ID and a boolean column that reports if a certain value is below a threshold, like so:
df = pd.DataFrame({'doc_ID': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'below_threshold': [False, False, False, False, True, False, False, True, False, False,
False, False, False, False, False, False, False, True, False, False]})
I'm trying to create a new grouping id within each doc_ID that would extend from the first False value in order until and including the first True value. Something like this
doc_ID below_threshold new_group
0 1 False 1
1 1 False 1
2 1 False 1
3 1 False 1
4 1 True 1
5 1 False 2
6 2 False 3
7 2 True 3
8 2 False 4
9 2 False 4
10 2 False 4
11 2 False 4
12 3 False 5
13 3 False 5
14 3 False 5
15 3 False 5
16 3 False 5
17 3 True 5
18 3 False 6
19 3 False 6

IIUC, use:
m1 = ~df['below_threshold']
m2 = df.groupby('doc_ID')['below_threshold'].shift(fill_value=True)
df['new_group'] = (m1&m2).cumsum()
Output:
doc_ID below_threshold new_group
0 1 False 1
1 1 False 1
2 1 False 1
3 1 False 1
4 1 True 1
5 1 False 2
6 2 False 3
7 2 True 3
8 2 False 4
9 2 False 4
10 2 False 4
11 2 False 4
12 3 False 5
13 3 False 5
14 3 False 5
15 3 False 5
16 3 False 5
17 3 True 5
18 3 False 6
19 3 False 6

Related

Pandas expanding count when column value changes

I have a dataframe
df = pd.DataFrame({'A': [True, True, False, False, False, False, False, True, True, True, False]})
A
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 True
10 False
I want to apply a count which expands using two criteria: each time column A changes value, or when succeeding rows are False. If succeeding rows are True the count should hold static. The desired output would be:
A B
0 True 1
1 True 1
2 False 2
3 False 3
4 False 4
5 False 5
6 False 6
7 True 7
8 True 7
9 True 7
10 False 8
I've faffed with a whole range of pandas functions and can't seem to figure it out.
Try:
1st condition: each time column A changes value: df.ne(df.shift()
2nd condition: when succeeding rows are False: df.eq(False)
and do a cumsum over the boolean mask:
>>> (df.ne(df.shift()) | df.eq(False)).cumsum()
A
0 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 7
9 7
10 8

Group boolean values in Pandas Dataframe

i have a Dataframe with a random series of True, False in a column:
import pandas as pd
df = pd.DataFrame(data={'A':[True, False, True, True, False, False, True, False, True, False, False]})
df
A
0
True
1
False
2
True
3
True
4
False
5
False
6
True
7
False
8
True
9
False
10
False
and i want this: (Dont know how to explain it with easy words)
A
B
0
True
1
1
False
2
2
True
2
3
True
2
4
False
3
5
False
3
6
True
3
7
False
4
8
True
4
9
False
5
10
False
5
I've tried something with the following commands, but without success:
df[A].shift()
df[A].diff()
df[A].eq()
Many thanks for your help.
Matthias
IIUC, you can try:
df['B'] = (df.A.shift() & ~df.A).cumsum() + 1
# OR df['B'] = (df.A.shift() & ~df.A).cumsum().add(1)
OUTPUT:
A B
0 True 1
1 False 2
2 True 2
3 True 2
4 False 3
5 False 3
6 True 3
7 False 4
8 True 4
9 False 5
10 False 5
A little bit logic with diff
(~df.A.astype(int).diff().ne(-1)).cumsum()+1
Out[234]:
0 1
1 2
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 5
10 5
Name: A, dtype: int32

How to create a Boolean Series based on the logic: (0 followed by a 1 is True. A 1 preceded by a 0 is True. All others are False)

I have the following DF. I'm Trying to make a Boolean Series where the logic is:
(0 followed by a 1 is True. A 1 preceded by a 0 is True. All others are False)
Here is the DataFrame
df = pd.DataFrame({'A': {0: 1, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 1, 7: 1, 8: 1, 9: 1, 10: 0, 11: 0, 12: 1}})
A
0 1
1 0
2 1
3 1
4 1
5 0
6 1
7 1
8 1
9 1
10 0
11 0
12 1
Expected Output (0 followed by a 1 is True. A 1 preceded by a 0 is True. All others are False:
A Truth
0 1 False
1 0 True
2 1 True
3 1 False
4 1 False
5 0 True
6 1 True
7 1 False
8 1 False
9 1 False
10 0 False
11 0 True
12 1 True
My ouput using: df['Truth'] = df['A'] == 0 | ( (df['A'].shift() == 0) & (df['A'] == 1) )
A Truth
0 1 False
1 0 True
2 1 True
3 1 False
4 1 False
5 0 True
6 1 True
7 1 False
8 1 False
9 1 False
10 0 True
11 0 True
12 1 True
I'm getting True on a zero, but a zero should only by True if followed by one, and not another zero. Any help would be appreciated. Thanks.
Try:
cond1 = df['A'].diff().shift(-1).eq(1).where(df['A']==0)
df['Truth'] = df['A'].diff().eq(1).where(df['A'] == 1).fillna(cond1).astype('bool')
print(df)
Output:
A Truth
0 1 False
1 0 True
2 1 True
3 1 False
4 1 False
5 0 True
6 1 True
7 1 False
8 1 False
9 1 False
10 0 False
11 0 True
12 1 True
Check condition 1 and only set it where A == 0 then check condition 2 and only set it where A == 1, use fillna to combine the two condtions.
In your case rolling sum should be 1
df.A.rolling(2).sum()==1
0 False
1 True
2 True
3 False
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 False
12 True
You can use your logic:
df['A'] != df['A'].shift(fill_value=df['A'].iloc[0])
Output:
0 False
1 True
2 True
3 False
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 False
12 True
Name: A, dtype: bool

deleting rows based on number of true values per group - Python

I am trying to delete rows based on groupby and number of True values.
Per group, if they have only one true value (sum() = 1), I would like that single row deleted.
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3,3,3], 'value': [True, True, False, True, False, False, False, False, True]})
print (df)
id value
0 1 True
1 1 True
2 1 False
3 2 True
4 2 False
5 2 False
6 3 False
7 3 False
8 3 True
df.groupby('id')['value'].sum()
Out[571]:
id
1 2.0
2 1.0
3 1.0
id 1 & 3 match the criteria, but how do i delete those single true rows such that the dataframe then becomes:
print (df)
id value
0 1 True
1 1 True
2 1 False
3 2 False
4 2 False
5 3 False
6 3 False
You can use a Boolean mask:
m1 = df.groupby('id')['value'].transform('sum') == 1
m2 = df['value']
df = df[~(m1 & m2)].reset_index(drop=True)
print(df)
id value
0 1 True
1 1 True
2 1 False
3 2 False
4 2 False
5 3 False
6 3 False

Get list of column names for columns that contain negative values

This is a simple question but I have found "slicing" DataFrames in Pandas frustrating, coming from R.
I have a DataFrame df below with 7 columns:
df
Out[77]:
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 8 8 -1 2 1 7 4
1 6 6 1 7 5 -1 3
2 2 5 4 2 2 8 1
3 -1 -1 7 2 3 2 0
4 6 6 4 2 0 5 2
5 -1 5 7 1 5 8 2
6 7 1 -1 0 1 8 1
7 6 2 4 1 2 6 1
8 3 4 4 5 8 -1 4
9 4 4 3 7 7 4 5
How do I slice df in such a way that it produces a list of columns that contain at least one negative number?
You can select them by building an appropriate Series and then using it to index into df:
>>> df < 0
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 False False True False False False False
1 False False False False False True False
2 False False False False False False False
3 True True False False False False False
4 False False False False False False False
5 True False False False False False False
6 False False True False False False False
7 False False False False False False False
8 False False False False False True False
9 False False False False False False False
>>> (df < 0).any()
fld1 True
fld2 True
fld3 True
fld4 False
fld5 False
fld6 True
fld7 False
dtype: bool
and then
>>> df.columns[(df < 0).any()]
Index(['fld1', 'fld2', 'fld3', 'fld6'], dtype='object')
or
>>> df.columns[(df < 0).any()].tolist()
['fld1', 'fld2', 'fld3', 'fld6']
depending on what data structure you want. We can also use this io index into df directly:
>>> df.loc[:,(df < 0).any()]
fld1 fld2 fld3 fld6
0 8 8 -1 7
1 6 6 1 -1
2 2 5 4 8
3 -1 -1 7 2
4 6 6 4 5
5 -1 5 7 8
6 7 1 -1 8
7 6 2 4 6
8 3 4 4 -1
9 4 4 3 4

Categories

Resources