I am trying to delete rows based on groupby and number of True values.
Per group, if they have only one true value (sum() = 1), I would like that single row deleted.
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3,3,3], 'value': [True, True, False, True, False, False, False, False, True]})
print (df)
id value
0 1 True
1 1 True
2 1 False
3 2 True
4 2 False
5 2 False
6 3 False
7 3 False
8 3 True
df.groupby('id')['value'].sum()
Out[571]:
id
1 2.0
2 1.0
3 1.0
id 1 & 3 match the criteria, but how do i delete those single true rows such that the dataframe then becomes:
print (df)
id value
0 1 True
1 1 True
2 1 False
3 2 False
4 2 False
5 3 False
6 3 False
You can use a Boolean mask:
m1 = df.groupby('id')['value'].transform('sum') == 1
m2 = df['value']
df = df[~(m1 & m2)].reset_index(drop=True)
print(df)
id value
0 1 True
1 1 True
2 1 False
3 2 False
4 2 False
5 3 False
6 3 False
Related
I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1
I have a dataframe
df = pd.DataFrame({'A': [True, True, False, False, False, False, False, True, True, True, False]})
A
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 True
10 False
I want to apply a count which expands using two criteria: each time column A changes value, or when succeeding rows are False. If succeeding rows are True the count should hold static. The desired output would be:
A B
0 True 1
1 True 1
2 False 2
3 False 3
4 False 4
5 False 5
6 False 6
7 True 7
8 True 7
9 True 7
10 False 8
I've faffed with a whole range of pandas functions and can't seem to figure it out.
Try:
1st condition: each time column A changes value: df.ne(df.shift()
2nd condition: when succeeding rows are False: df.eq(False)
and do a cumsum over the boolean mask:
>>> (df.ne(df.shift()) | df.eq(False)).cumsum()
A
0 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 7
9 7
10 8
i have a Dataframe with a random series of True, False in a column:
import pandas as pd
df = pd.DataFrame(data={'A':[True, False, True, True, False, False, True, False, True, False, False]})
df
A
0
True
1
False
2
True
3
True
4
False
5
False
6
True
7
False
8
True
9
False
10
False
and i want this: (Dont know how to explain it with easy words)
A
B
0
True
1
1
False
2
2
True
2
3
True
2
4
False
3
5
False
3
6
True
3
7
False
4
8
True
4
9
False
5
10
False
5
I've tried something with the following commands, but without success:
df[A].shift()
df[A].diff()
df[A].eq()
Many thanks for your help.
Matthias
IIUC, you can try:
df['B'] = (df.A.shift() & ~df.A).cumsum() + 1
# OR df['B'] = (df.A.shift() & ~df.A).cumsum().add(1)
OUTPUT:
A B
0 True 1
1 False 2
2 True 2
3 True 2
4 False 3
5 False 3
6 True 3
7 False 4
8 True 4
9 False 5
10 False 5
A little bit logic with diff
(~df.A.astype(int).diff().ne(-1)).cumsum()+1
Out[234]:
0 1
1 2
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 5
10 5
Name: A, dtype: int32
I am trying to create a count that accumulates streaks but can be cancelled by a different column. There are three outcomes in this count
The streak accumulates based on flag == true.
The streak resets on cancel on cancel == true.
The streak does nothing and repeats the current streak.
I have tried several different approaches attempting to combinine the flag and cancel using np.where, masking groupby with where, multiple cumsums, fills, and ngroup, but cannot get the result wanted.
df = pd.DataFrame(
{
"cond1": [True, False, True, False, True, False, True],
"cond2": [False, False, False, True, False, False, False]
})
df['flag'] = np.where(df['cond1'], 1, 0)
df['cancel'] = np.where(df['cond2'], 1, 0)
# Combined
df['combined'] = df['flag'] - df['cancel']
# Cumsum only
df['cumsum'] = df['combined'].cumsum()
# Cumcount masked by where
df['cumsum_cumcount'] = df.where(df['cond1']).groupby((df['cond2']).cumsum()).cumcount()
# Cumcount then cumsum
df['cumsum_cumcount_cumsum'] = df.where(df['cancel'] == False).groupby(df['flag'].cumsum()).cumcount().cumsum()
cond1 cond2 flag cancel c2 c3 c1
0 True False 1 0 0 0 1
1 False False 0 0 1 1 1
2 True False 1 0 2 1 2
3 False True 0 1 0 2 1
4 True False 1 0 1 2 2
5 False False 0 0 2 3 2
6 True False 1 0 3 3 3
cond1 cond2 streak
0 True False 1
1 False False 1
2 True False 2
3 False True 0
4 True False 1
5 False False 1
6 True False 2
7 True False 3
8 False False 3
9 True False 4
10 False True 0
11 False False 0
12 True False 1
The current streak repeats, accumulates when cond1 is true and resets when cond2 is false. Big bonus points if this could accumulate in the opposite direction too without too much hassle. Cancels being negatives flags being positives.
Seems like you need cumsum with cond2 create the group key then cumsum with cond1
df.groupby(df.cond2.cumsum()).cond1.cumsum()
Out[155]:
0 1.0
1 1.0
2 2.0
3 0.0
4 1.0
5 1.0
6 2.0
7 3.0
8 3.0
9 4.0
10 0.0
11 0.0
12 1.0
Name: cond1, dtype: float64
I want to check if any value of column 'c' is smaller than all previous column values.
In my current approach I am using pandas diff(), but it let's me only compare to the previous value.
import pandas as pd
df = pd.DataFrame({'c': [1, 4, 9, 7, 8, 36]})
df['diff'] = df['c'].diff() < 0
print(df)
Current result:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 False
5 36 False
Wanted result:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 True
5 36 False
So row 4 should also result in a True, as 8 is smaller than 9.
Thanks
This should work:
df['diff'] = df['c'] < df['c'].cummax()
Output is just as you mentioned:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 True
5 36 False