My dataframe df:
SCHOOL CLASS GRADE
A Spanish nan
A Spanish nan
A Math 4000
A Math 7830
A Math 3893
B . nan
B . nan
B Biology 1929
B Biology 4839
B Biology 8195
C Spanish nan
C English 2003
C English 1000
C Biology 4839
C Biology 8191
If I do:
school_has_only_two_classes = df.groupby('SCHOOL').CLASS
.transform(lambda series: series.nunique()) == 2
I get
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 False
12 False
13 False
14 False
15 False
The transform works fine for the school C. BUT, if I do:
school_has_spanish = df.groupby('SCHOOL').CLASS.transform(lambda series: series.str.contains('^Spanish$',regex=True))
or
school_has_spanish = df.groupby('SCHOOL').CLASS.transform(lambda series: series=='Spanish')
I get the following result which is not what I was expecting:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 False
12 False
13 False
14 False
15 False
The transform just does not spread all True's to the other rows of the group. Result I was expecting:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 True
15 True
Any help is appreciated.
Check any with contains
df.CLASS.str.contains('Spanish').groupby(df.SCHOOL).transform('any')
Out[230]:
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 True
Name: CLASS, dtype: bool
Related
I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8
I want to return a boolean index using separate columns. Where End is in Item, I want to return False.
I'm meeting those conditions but I want to account for all unique values in Seq. For each unique group in Seq, if any row matches the previous condition, then return False for all those unique groups.
df = pd.DataFrame({
'Item' : ['Start','A','B','B','G','Start','A','B','B','A','X','Start','A','H'],
})
End = ['X','Y','Z']
df['Seq'] = df['Item'].eq('Start').groupby(df['Item'].eq('Start').cumsum()).transform('idxmax')
m2 = df.Item.isin(End)
out:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 True
12 True
13 True
intended out:
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
8 False
9 False
10 False
11 True
12 True
13 True
Instead of idxmax, use max and then negate the result:
~df.Item.isin(End).groupby(df.Item.eq('Start').cumsum()).transform('max')
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 True
12 True
13 True
Name: Item, dtype: bool
To exclude row with Start:
~(df.Item.isin(End).groupby(df.Item.eq('Start').cumsum()).transform('max') & df.Item.ne('Start'))
Group the boolean mask m2 by Seq and transform with any then negate the output
~(m2.groupby(df['Seq']).transform('any'))
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 True
12 True
13 True
Name: Item, dtype: bool
i have a Dataframe with a random series of True, False in a column:
import pandas as pd
df = pd.DataFrame(data={'A':[True, False, True, True, False, False, True, False, True, False, False]})
df
A
0
True
1
False
2
True
3
True
4
False
5
False
6
True
7
False
8
True
9
False
10
False
and i want this: (Dont know how to explain it with easy words)
A
B
0
True
1
1
False
2
2
True
2
3
True
2
4
False
3
5
False
3
6
True
3
7
False
4
8
True
4
9
False
5
10
False
5
I've tried something with the following commands, but without success:
df[A].shift()
df[A].diff()
df[A].eq()
Many thanks for your help.
Matthias
IIUC, you can try:
df['B'] = (df.A.shift() & ~df.A).cumsum() + 1
# OR df['B'] = (df.A.shift() & ~df.A).cumsum().add(1)
OUTPUT:
A B
0 True 1
1 False 2
2 True 2
3 True 2
4 False 3
5 False 3
6 True 3
7 False 4
8 True 4
9 False 5
10 False 5
A little bit logic with diff
(~df.A.astype(int).diff().ne(-1)).cumsum()+1
Out[234]:
0 1
1 2
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 5
10 5
Name: A, dtype: int32
I have a dataframe like this:
Bool Hour
0 False 12
1 False 24
2 False 12
3 False 24
4 True 12
5 False 24
6 False 12
7 False 24
8 False 12
9 False 24
10 False 12
11 True 24
and I would like to backfill the True value in 'Bool' column to the point when 'Hour' first reaches '12'. The result would be something like this:
Bool Hour Result
0 False 12 False
1 False 24 False
2 False 12 True <- desired backfill
3 False 24 True <- desired backfill
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True <- desired backfill
11 True 24 True
Any help is greatly appreciated! Thank you very much!
This is a little bit hard to achieve , here we can use groupby with idxmax
s=(~df.Bool&df.Hour.eq(12)).iloc[::-1].groupby(df.Bool.iloc[::-1].cumsum()).transform('idxmax')
df['result']=df.index>=s.iloc[::-1]
df
Out[375]:
Bool Hour result
0 False 12 False
1 False 24 False
2 False 12 True
3 False 24 True
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True
11 True 24 True
IIUC, you can do:
s = df['Bool'].shift(-1)
df['Result'] = df['Bool'] | s.where(s).groupby(df['Hour'].eq(12).cumsum()).bfill()
Output:
Bool Hour Result
0 False 12 False
1 False 24 False
2 False 12 True
3 False 24 True
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True
11 True 24 True
create a groupID s on consecutive False and separate True from them. Groupby on Hour equals 12 by using s. Use transform sum and cumsum to get the count of True on 12 from bottom-up on each group and return True on 0 and or with values of Bool
s = df.Bool.ne(df.Bool.shift()).cumsum()
s1 = df.where(df.Bool).Bool.bfill()
g = df.Hour.eq(12).groupby(s)
df['bfill_Bool'] = (g.transform('sum') - g.cumsum()).eq(0) & s1 | df.Bool
Out[905]:
Bool Hour bfill_Bool
0 False 12 False
1 False 24 False
2 False 12 True
3 False 24 True
4 True 12 True
5 False 24 False
6 False 12 False
7 False 24 False
8 False 12 False
9 False 24 False
10 False 12 True
11 True 24 True
This is a simple question but I have found "slicing" DataFrames in Pandas frustrating, coming from R.
I have a DataFrame df below with 7 columns:
df
Out[77]:
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 8 8 -1 2 1 7 4
1 6 6 1 7 5 -1 3
2 2 5 4 2 2 8 1
3 -1 -1 7 2 3 2 0
4 6 6 4 2 0 5 2
5 -1 5 7 1 5 8 2
6 7 1 -1 0 1 8 1
7 6 2 4 1 2 6 1
8 3 4 4 5 8 -1 4
9 4 4 3 7 7 4 5
How do I slice df in such a way that it produces a list of columns that contain at least one negative number?
You can select them by building an appropriate Series and then using it to index into df:
>>> df < 0
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 False False True False False False False
1 False False False False False True False
2 False False False False False False False
3 True True False False False False False
4 False False False False False False False
5 True False False False False False False
6 False False True False False False False
7 False False False False False False False
8 False False False False False True False
9 False False False False False False False
>>> (df < 0).any()
fld1 True
fld2 True
fld3 True
fld4 False
fld5 False
fld6 True
fld7 False
dtype: bool
and then
>>> df.columns[(df < 0).any()]
Index(['fld1', 'fld2', 'fld3', 'fld6'], dtype='object')
or
>>> df.columns[(df < 0).any()].tolist()
['fld1', 'fld2', 'fld3', 'fld6']
depending on what data structure you want. We can also use this io index into df directly:
>>> df.loc[:,(df < 0).any()]
fld1 fld2 fld3 fld6
0 8 8 -1 7
1 6 6 1 -1
2 2 5 4 8
3 -1 -1 7 2
4 6 6 4 5
5 -1 5 7 8
6 7 1 -1 8
7 6 2 4 6
8 3 4 4 -1
9 4 4 3 4