I would like to get a subset of a pandas dataframe with boolean indexing.
The condition I want to test is like (df[var_0] == value_0) & ... & (df[var_n] == value_n) where the number n of variables involved can change. As a result I am not able to write :
df = df[(df[var_0] == value_0) & ... & (df[var_n] == value_n)]
I could do something like :
for k in range(0,n+1) :
df = df[df[var_k] == value_k]
(with some try catch to make sure it works if the dataframe goes empty), but that does not seems very efficient.
Has anyone an idea on how to write that in a clean pandas formulation ?
The isin method should work for you here.
In [7]: df
Out[7]:
a b c d e
0 6 3 1 9 6
1 8 9 5 7 2
2 6 4 7 4 3
3 4 8 0 0 5
4 4 4 2 3 4
5 2 5 9 0 9
6 4 8 2 9 1
7 3 0 8 9 7
8 0 5 9 9 6
9 0 7 8 4 8
[10 rows x 5 columns]
In [8]: vals = {'a': [3], 'b': [0], 'c': [8], 'd': [9], 'e': [7]}
In [9]: df.isin(vals)
Out[9]:
a b c d e
0 False False False True False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
5 False False False False False
6 False False False True False
7 True True True True True
8 False False False True False
9 False False True False False
[10 rows x 5 columns]
In [10]: df[df.isin(vals).all(1)]
Out[10]:
a b c d e
7 3 0 8 9 7
[1 rows x 5 columns]
The values in the vals dict need to be a collection, so I put them into length 1 lists. It's possibly that query can do this too.
Related
I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8
i have a Dataframe with a random series of True, False in a column:
import pandas as pd
df = pd.DataFrame(data={'A':[True, False, True, True, False, False, True, False, True, False, False]})
df
A
0
True
1
False
2
True
3
True
4
False
5
False
6
True
7
False
8
True
9
False
10
False
and i want this: (Dont know how to explain it with easy words)
A
B
0
True
1
1
False
2
2
True
2
3
True
2
4
False
3
5
False
3
6
True
3
7
False
4
8
True
4
9
False
5
10
False
5
I've tried something with the following commands, but without success:
df[A].shift()
df[A].diff()
df[A].eq()
Many thanks for your help.
Matthias
IIUC, you can try:
df['B'] = (df.A.shift() & ~df.A).cumsum() + 1
# OR df['B'] = (df.A.shift() & ~df.A).cumsum().add(1)
OUTPUT:
A B
0 True 1
1 False 2
2 True 2
3 True 2
4 False 3
5 False 3
6 True 3
7 False 4
8 True 4
9 False 5
10 False 5
A little bit logic with diff
(~df.A.astype(int).diff().ne(-1)).cumsum()+1
Out[234]:
0 1
1 2
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 5
10 5
Name: A, dtype: int32
My dataframe df:
SCHOOL CLASS GRADE
A Spanish nan
A Spanish nan
A Math 4000
A Math 7830
A Math 3893
B . nan
B . nan
B Biology 1929
B Biology 4839
B Biology 8195
C Spanish nan
C English 2003
C English 1000
C Biology 4839
C Biology 8191
If I do:
school_has_only_two_classes = df.groupby('SCHOOL').CLASS
.transform(lambda series: series.nunique()) == 2
I get
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 False
12 False
13 False
14 False
15 False
The transform works fine for the school C. BUT, if I do:
school_has_spanish = df.groupby('SCHOOL').CLASS.transform(lambda series: series.str.contains('^Spanish$',regex=True))
or
school_has_spanish = df.groupby('SCHOOL').CLASS.transform(lambda series: series=='Spanish')
I get the following result which is not what I was expecting:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 False
12 False
13 False
14 False
15 False
The transform just does not spread all True's to the other rows of the group. Result I was expecting:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 True
15 True
Any help is appreciated.
Check any with contains
df.CLASS.str.contains('Spanish').groupby(df.SCHOOL).transform('any')
Out[230]:
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 True
Name: CLASS, dtype: bool
I want to check if any value of column 'c' is smaller than all previous column values.
In my current approach I am using pandas diff(), but it let's me only compare to the previous value.
import pandas as pd
df = pd.DataFrame({'c': [1, 4, 9, 7, 8, 36]})
df['diff'] = df['c'].diff() < 0
print(df)
Current result:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 False
5 36 False
Wanted result:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 True
5 36 False
So row 4 should also result in a True, as 8 is smaller than 9.
Thanks
This should work:
df['diff'] = df['c'] < df['c'].cummax()
Output is just as you mentioned:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 True
5 36 False
This is a simple question but I have found "slicing" DataFrames in Pandas frustrating, coming from R.
I have a DataFrame df below with 7 columns:
df
Out[77]:
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 8 8 -1 2 1 7 4
1 6 6 1 7 5 -1 3
2 2 5 4 2 2 8 1
3 -1 -1 7 2 3 2 0
4 6 6 4 2 0 5 2
5 -1 5 7 1 5 8 2
6 7 1 -1 0 1 8 1
7 6 2 4 1 2 6 1
8 3 4 4 5 8 -1 4
9 4 4 3 7 7 4 5
How do I slice df in such a way that it produces a list of columns that contain at least one negative number?
You can select them by building an appropriate Series and then using it to index into df:
>>> df < 0
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 False False True False False False False
1 False False False False False True False
2 False False False False False False False
3 True True False False False False False
4 False False False False False False False
5 True False False False False False False
6 False False True False False False False
7 False False False False False False False
8 False False False False False True False
9 False False False False False False False
>>> (df < 0).any()
fld1 True
fld2 True
fld3 True
fld4 False
fld5 False
fld6 True
fld7 False
dtype: bool
and then
>>> df.columns[(df < 0).any()]
Index(['fld1', 'fld2', 'fld3', 'fld6'], dtype='object')
or
>>> df.columns[(df < 0).any()].tolist()
['fld1', 'fld2', 'fld3', 'fld6']
depending on what data structure you want. We can also use this io index into df directly:
>>> df.loc[:,(df < 0).any()]
fld1 fld2 fld3 fld6
0 8 8 -1 7
1 6 6 1 -1
2 2 5 4 8
3 -1 -1 7 2
4 6 6 4 5
5 -1 5 7 8
6 7 1 -1 8
7 6 2 4 6
8 3 4 4 -1
9 4 4 3 4