Example.
Let's say I have dataframe with several columns and I want to select rows that match all 4 conditions, I would write:
condition = (df['A'] < 10) & (df['B'] < 10) & (df['C'] < 10) & (df['D'] < 10)
df.loc[condition]
Contrary to that if I want to select rows that match any of 4 conditions I would write:
condition = (df['A'] < 10) | (df['B'] < 10) | (df['C'] < 10) | (df['D'] < 10)
df.loc[condition]
Now, what if I want to select rows that match any two of those 4 conditions? That would be rows that match any combination of either columns (A and B), (A and C), (A and D), (B and C) or (C and D). It is obvious that I can write complex condition with all those combinations:
condition = ((df['A'] < 10) & (df['B'] < 10)) |\
((df['A'] < 10) & (df['C'] < 10)) |\
((df['A'] < 10) & (df['D'] < 10)) |\
((df['B'] < 10) & (df['C'] < 10)) |\
((df['C'] < 10) & (df['D'] < 10))
df.loc[condition]
But if there is 50 columns and I want to match any 20 columns of those 50, that would become impossible to list all possible combinations into condition. Is there a way to do that somehow better?
Since True == 1 and False == 0 you can find rows that satisfy atleast N conditions by checking the sum. Series have most of the basic comparisons as attributes so you could make a single condition list with a variety of checks and then use getattr to make it tidy.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0, 20, (5,4)), columns=list('ABCD'))
# can check `eq`, `lt`, `le`, `gt`, `ge`, `isin`
cond_list = [('A', 'lt', 10), ('B', 'ge', 10), ('D', 'eq', 4),
('C', 'isin', [2, 4, 6])]
df_c = pd.concat([getattr(df[col], attr)(val).astype(int)
for col,attr,val in cond_list], axis=1)
# A B D C
#0 0 0 0 1
#1 0 1 0 0
#2 1 1 0 0
#3 1 1 0 0
#4 0 1 0 1
# Each row satisfies this many conditions
df_c.sum(1)
#0 1
#1 1
#2 2
#3 2
#4 2
#dtype: int64
#Select those that satisfy at least 2.
df[df_c.sum(1).ge(2)]
# A B C D
#2 0 17 15 9
#3 0 14 0 15
#4 19 14 4 0
If you need some more complicated comparisons that aren't possible with .getattr then you can write them out yourself and concat that list of Series.
df_c = pd.concat([df['A'].lt(10), df['B'].ge(10), df['D'].eq(4),
df['C'].isin([2,4,6])],
axis=1).astype(int)
Here's a method using itertools.combinations, so we can get all necessary combinations of our conditions. Then we check the "amount" of time condition1 & condition2 are True:
# test dataframe
np.random.seed(10)
df = pd.DataFrame(np.random.randint(20, size=(10,5)), columns=list('ABCDE'))
print(df)
A B C D E
0 9 4 15 0 17
1 16 17 8 9 0
2 10 8 4 19 16
3 4 15 11 11 1
4 8 4 14 17 19
5 13 5 13 19 13
6 12 1 4 18 13
7 11 10 9 15 18
8 16 7 11 17 14
9 7 11 1 0 12
from itertools import combinations
conditions = [(df['A'] < 10), (df['B'] < 15), (df['C'] >= 5), (df['D'] <= 9)]
mask = pd.concat([x&y for x, y in combinations(conditions, 2)], axis=1).sum(axis=1).ge(2)
df[mask]
A B C D E
0 9 4 15 0 17
4 8 4 14 17 19
9 7 11 1 0 12
Related
For example I have simple DF:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal to 900, using methods and idioms of Pandas?
Sure! Setup:
>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
A B C
0 9 40 300
1 9 70 700
2 5 70 900
3 8 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
We can apply column operations and get boolean Series objects:
>>> df["B"] > 50
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
[Update, to switch to new-style .loc]:
And then we can use these to index into the object. For read access, you can chain indices:
>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2 5
3 8
Name: A, dtype: int64
but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2 5
3 8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
A B C
0 9 40 300
1 9 70 700
2 5000 70 900
3 8000 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)
Another solution is to use the query method:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
'B': [randint(1, 9) * 10 for x in xrange(10)],
'C': [randint(1, 9) * 100 for x in xrange(10)]})
print df
A B C
0 7 20 300
1 7 80 700
2 4 90 100
3 4 30 900
4 7 80 200
5 7 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
print df.query('B > 50 and C != 900')
A B C
1 7 80 700
2 4 90 100
4 7 80 200
5 7 60 800
Now if you want to change the returned values in column A you can save their index:
my_query_index = df.query('B > 50 & C != 900').index
....and use .iloc to change them i.e:
df.iloc[my_query_index, 0] = 5000
print df
A B C
0 7 20 300
1 5000 80 700
2 5000 90 100
3 4 30 900
4 5000 80 200
5 5000 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
And remember to use parenthesis!
Keep in mind that & operator takes a precedence over operators such as > or < etc. That is why
4 < 5 & 6 > 4
evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:
df.loc[(df['A'] > 10) & (df['B'] < 15)]
instead of
df.loc[df['A'] > 10 & df['B'] < 15]
which would result in
TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)
df[['A']][df.B.gt(50) & df.C.ne(900)]
df[['A']] will give you back column A in DataFrame format.
pandas gt function will return the positions of column B that are greater than 50 and ne will return the positions not equal to 900.
It may be more readable to assign each condition to a variable, especially if there are a lot of them (maybe with descriptive names) and chain them together using bitwise operators such as (& or |). As a bonus, you don't need to worry about brackets () because each condition evaluates independently.
m1 = df['B'] > 50
m2 = df['C'] != 900
m3 = df['C'].pow(2) > 1000
m4 = df['B'].mul(4).between(50, 500)
# filter rows where all of the conditions are True
df[m1 & m2 & m3 & m4]
# filter rows of column A where all of the conditions are True
df.loc[m1 & m2 & m3 & m4, 'A']
or put the conditions in a list and reduce it via bitwise_and from numpy (wrapper for &).
conditions = [
df['B'] > 50,
df['C'] != 900,
df['C'].pow(2) > 1000,
df['B'].mul(4).between(50, 500)
]
# filter rows of A where all of conditions are True
df.loc[np.bitwise_and.reduce(conditions), 'A']
I have a simple question regarding the concept of filtering a dataframe. suppose I have the below dataframe:
df = pd.DataFrame({'AAA': [4, 5, 6, 7],'BBB': [10, 20, 30, 40],'CCC': [100, 50, -30, -50]})
I would like to perform some modifications on it base on some conditions.
If I run the below code I get my desired row:
Method a
df[(df.AAA <= 5) & (df.BBB <= 10)]
I can also get that row with the below code:
Method b
df.loc[(df.AAA <= 5) & (df.BBB <= 10)]
Both of Method a and Method b, resulted in pandas dataframe.
However when I want to modify column "CCC" based on those conditions, I get error with the "Method a":
Method a
df[(df.AAA <= 5) & (df.BBB <= 10), 'CCC'] = -1
'Series' objects are mutable, thus they cannot be hashed
Method b
df.loc[(df.AAA <= 5) & (df.BBB <= 10), 'CCC'] = -1
If need set new values by mask and also by column name, DataFrame.loc is always necessary, because selecting column name, here CCC:
df.loc[(df.AAA <= 5) & (df.BBB <= 10), 'CCC'] = -1
print (df)
AAA BBB CCC
0 4 10 -1
1 5 20 50
2 6 30 -30
3 7 40 -50
If need set multiple columns use loc and list of columns names:
df.loc[(df.AAA <= 5) & (df.BBB <= 10), ['CCC', 'AAA']] = -1
print (df)
AAA BBB CCC
0 -1 10 -1
1 5 20 50
2 6 30 -30
3 7 40 -50
But if need to set all columns remove loc and also column name:
df[(df.AAA <= 5) & (df.BBB <= 10)] = -1
print (df)
AAA BBB CCC
0 -1 -1 -1
1 5 20 50
2 6 30 -30
3 7 40 -50
EDIT:
Solution from comment working:
df['CCC'][(df.AAA <= 5) & (df.BBB <= 10)] = -1
but not recommended, because this code lead to possible SettingWithCopyWarning .
I have data which looks like this(I've set 'rule_id' as the index):
rule_id a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
After using this code:
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
df['comp{}'.format(i)] = np.select(conditions , choices)
old = df.columns[0] # store name of first column
#Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
my data looks like this:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 100 NaN
52879 0 4 3 2 NaN 87.5 41.66 100
So 'df' here is the dataframe which stores my data which I have mentioned above.
Look at the first row . According to my code , if two columns are compared and the first column has a non-zero value(2) and the second column has 0 , then 100 should be updated in the new column , which I am able to achieve , if there is comparison between more than one non-zero value (look at row 2) , then the comparison is like this:
9/12 *50 +50 = 87.5
then
6/9 * 25 + 25 = 41.66
which I am able to achieve but the third comparison between column 'c' and 'd' which is between value 6 and 0 should be:
0/6 *12.5 + 12.5 = 12.5
which I am having problem in achieving. So instead of 100 in row 2 comp3 , the value should be 12.5. Same goes for the last row too where values are 4 ,3 and 2
This is the result I want:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 12.5 NaN
52879 0 4 3 2 NaN 87.5 41.66 12.5
You say:
the third comparison between column 'c' and 'd' which is between value 6 and 0 should be:
0/6 *12.5 + 12.5 = 12.5
But your code says:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
Clearly (6, 0) satisfies condition[1] and therefore produces 100. You seem to think it should satisfy condition[3] which is that both are non-zero, but (6, 0) does not satisfy that condition, and even if it did it would not matter because condition[1] is matched first, and np.select() chooses the first match.
Perhaps you want something like this:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] == df[col2])]
choices = [np.nan , coeff[col1]]
default = df[col2]/df[col1]*coeff[col1]+coeff[col1]
df['comp{}'.format(i)] = np.select(conditions , choices, default)
Just to participate, here is a contribution to your code, for the definition of the coeff matrix, where the computation is performed directly on whole columns.
Initialization:
>>> df = pd.DataFrame([[2, 0, 0, 5], [12, 9, 6, 0], [0, 4, 3, 2]],
... index=[50378, 50402, 52879],
... columns=['a', 'b', 'c', 'd'])
>>> df
a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
Then computing the coefficients:
>>> # taking care of coefficients, using direct computation on columns
>>> coeff2 = pd.DataFrame(index=df.index, columns=df.columns)
>>> top = pd.Series([100]*len(df.index), index=df.index)
>>> for col_name, col in df.iteritems(): # loop over columns
... eq0 = (col==0) # boolean serie, identifying rows where content is 0
... top[eq0] = 100 # where `eq0` is `True`, set 100...
... top[~eq0] = top[~eq0] / 2 # ... and divide others by 2
... coeff2[col_name] = top # assign to output
>>> coeff2
Which gives:
a b c d
50378 50 100 100 50
50402 50 25 12.5 100
52879 100 50 25 12.5
(For the core of your question, John identified the lack of condition in the function, so no need for me to participate.)
I want to apply a lambda function to a DataFrame column using if...elif...else within the lambda function.
The df and the code are something like:
df=pd.DataFrame({"one":[1,2,3,4,5],"two":[6,7,8,9,10]})
df["one"].apply(lambda x: x*10 if x<2 elif x<4 x**2 else x+10)
Obviously, this doesn't work. Is there a way to apply if....elif....else to a lambda?
How can I get the same result with List Comprehension?
Nest if .. elses:
lambda x: x*10 if x<2 else (x**2 if x<4 else x+10)
I do not recommend the use of apply here: it should be avoided if there are better alternatives.
For example, if you are performing the following operation on a Series:
if cond1:
exp1
elif cond2:
exp2
else:
exp3
This is usually a good use case for np.where or np.select.
numpy.where
The if else chain above can be written using
np.where(cond1, exp1, np.where(cond2, exp2, ...))
np.where allows nesting. With one level of nesting, your problem can be solved with,
df['three'] = (
np.where(
df['one'] < 2,
df['one'] * 10,
np.where(df['one'] < 4, df['one'] ** 2, df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
numpy.select
Allows for flexible syntax and is easily extensible. It follows the form,
np.select([cond1, cond2, ...], [exp1, exp2, ...])
Or, in this case,
np.select([cond1, cond2], [exp1, exp2], default=exp3)
df['three'] = (
np.select(
condlist=[df['one'] < 2, df['one'] < 4],
choicelist=[df['one'] * 10, df['one'] ** 2],
default=df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
and/or (similar to the if/else)
Similar to if-else, requires the lambda:
df['three'] = df["one"].apply(
lambda x: (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10)
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
List Comprehension
Loopy solution that is still faster than apply.
df['three'] = [x*10 if x<2 else (x**2 if x<4 else x+10) for x in df['one']]
# df['three'] = [
# (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10) for x in df['one']
# ]
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
For readability I prefer to write a function, especially if you are dealing with many conditions. For the original question:
def parse_values(x):
if x < 2:
return x * 10
elif x < 4:
return x ** 2
else:
return x + 10
df['one'].apply(parse_values)
You can do it using multiple loc operators. Here is a newly created column labelled 'new' with the conditional calculation applied:
df.loc[(df['one'] < 2), 'new'] = df['one'] * 10
df.loc[(df['one'] < 4), 'new'] = df['one'] ** 2
df.loc[(df['one'] >= 4), 'new'] = df['one'] + 10
I have extracted some data in pandas format from a sql server. The structure like this:
df = pd.DataFrame({'Day':(1,2,3,4,1,2,3,4),'State':('A','A','A','A','B','B','B','B'),'Direction':('N','S','N','S','N','S','N','S'),'values':(12,34,22,37,14,16,23,43)})
>>> df
Day Direction State values
0 1 N A 12
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Now I want to substitute all values with same day and same Direction but with (State == A) by itself + values with same day and same State but with (State == B). For example, like this:
df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'] = df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'].values + df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'B'),'values'].values
>>> df
Day Direction State values
0 1 N A 26
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Notice the first line values have been changed from 12 to 26(12 + 14)
Since the values are from different rows, so kind of difficult to use combine_first functions?
Now I have to use two loops (on 'Day' and on 'Direction') and the above attribution sentence to do, it's extremely slow when the dataframe's getting big. Do you have any smart and efficient way doing this?
You can first define a function to do add values from B to A in the same group. Then apply this function to each group.
def f(x):
x.loc[x.State=='A','values']+=x.loc[x.State=='B','values'].iloc[0]
return x
df.groupby(['Day','Direction']).apply(f)
Out[94]:
Day Direction State values
0 1 N A 26
1 2 S A 50
2 3 N A 45
3 4 S A 80
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43