pandas.DataFrame why use parenthesis to wrap operations to make bitwise comparison - python

A DataFrame is called c, and it has a column called price in which I want to know the rows with price equal to 2 or 3. And the code works here
c[(c['price'] == 2) | (c['price'] == 3)]
But doesn't work here:
c[c['price'] == 2 | c['price'] == 3]
and raises an exception:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The only difference is in the second line of code, there's no parenthesis '()' wrapped to the operation. So why the parenthesis is so important?
Thank you very much!

As per Pandas: Boolean indexing
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A > 2) & (df['B'] < 3)

Related

The truth value of a is ambiguous when I used Iflese in Python

I am using conditional multiplication within data frame and using following syntax:
if(df_merged1["region_id"]=="EMEA"):
df_merged1["fcst_gr"] = df_merged1["plan_price_amount"]*(df_merged1["Enr"]-df_merged1["FM_f"])+df_merged1["OA_f"]-df_merged1["TX_f"]
else:
df_merged1["fcst_gr"] = df_merged1["plan_price_amount"]*(df_merged1["Enr"]-df_merged1["FM_f"])+df_merged1["OA_f"]
i want tax to be substracted only when region is EMEA. but getting following error
ValueError: The truth value of a {type(self).__name__} is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think there is some problem in proving the if condition but how to resolve it not getting any idea
There is no problem here - df_merged1["region_id"] == "EMEA" returns a pd.Series instance populated with boolean values, not a boolean that can be handled using conditional statements. Pandas is reluctant to automatically run a method that would convert a pd.Series instance to a boolean like pd.Series.any() or pd.Series.all(), hence the error.
To achieve what you have meant to do for reasonably sized dataframes, use pd.DataFrame.apply, axis=1 with a lambda expression and a ternary operator. That way you populate a column ["fcst_gr"] based on value in column ["region_id"] for each individual row:
df_merged1["fcst_gr"] = df_merged1.apply(
lambda row: row["plan_price_amount"] * (row["Enr"] - row["FM_f"])
+ row["OA_f"]
- row["TX_f"]
if row["region_id"] == "EMEA"
else row["plan_price_amount"] * (row["Enr"] - row["FM_f"]) + row["OA_f"],
axis=1,
)
For bigger dataframes or more complex scenarios, consider more efficient solutions.

Multiple if conditions pandas

Looking to write an if statement which does a calculation based on if 3 conditions across other columns in a dataframe are true. I have tried the below code which seems to have worked for others on stackoverflow but kicks up an error for me. Note the 'check', 'sqm' and 'sqft' columns are in float64 format.
if ((merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)):
merge['checksqm'] == merge['sqft']/10.7639
#Error below:
alueError Traceback (most recent call last)
<ipython-input-383-e84717fde2c0> in <module>
----> 1 if ((merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)):
2 merge['checksqm'] == merge['sqft']/10.7639
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
1327
1328 def __nonzero__(self):
-> 1329 raise ValueError(
1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Each condition you code evaluates into a series of multiple boolean values. The combined result of the 3 conditions also become a boolean series. Python if statement cannot handle such Pandas series with evaluating each element in the series and feed to the statement following it one by one. Hence, the error ValueError: The truth value of a Series is ambiguous.
To solve the problem, you have to code it using Pandas syntax, like the following:
mask = (merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)
merge.loc[mask, 'checksqm'] = merge['sqft']/10.7639
or, combine in one statement, as follows:
merge.loc[(merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0), 'checksqm'] = merge['sqft']/10.7639
In this way, Pandas can evaluate the boolean series and work on the rows corresponding to True values of the combined 3 conditions and process each row one by one taking corresponding values from each row for processing. This kind of vectorized operation under the scene is not supported by ordinary Python statement such as if statement.
You are trying to use pd.Series as the condition inside the if clause. This series is a mask of True, False values. You need to cast the series to bool using series.any() or series.all().

Choosing rows from a dataframe based on multiple functions [duplicate]

This question already has answers here:
pandas: multiple conditions while indexing data frame - unexpected behavior
(5 answers)
Closed 3 years ago.
I have a dataframe df with a column "A". How do I choose a subset of df based on multiple conditions. I am trying:
train.loc[(train["A"] != 2) or (train["A"] != 10)]
The or operator doesnt seem to be working. How can I fix this? I got the error:
ValueError Traceback (most recent call last)
<ipython-input-30-e949fa2bb478> in <module>
----> 1 sub_train.loc[(sub_train["primary_use"] != 2) or (sub_train["primary_use"] != 10), "year_built"]
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in __nonzero__(self)
1553 "The truth value of a {0} is ambiguous. "
1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555 self.__class__.__name__
1556 )
1557 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Use | for bitwise OR or & for bitwise AND, also loc is not necessary:
#filter 2 or 10
train[(train["A"] == 2) | (train["A"] == 10)]
#filter not 2 and not 10
train[(train["A"] != 2) & (train["A"] != 10)]
If want also select some columns then is necessary:
train.loc[(train["A"] == 2) | (train["A"] == 10), 'B']
you need | instead of OR to do logic with Series:
train.loc[(train["A"] != 2) | (train["A"] != 10)]
To not worry about parentheses use Series.ne.
loc here in principle is not necessary if you do not want to select a specific column:
train[train["A"].ne(2) | train["A"].ne(10)]
But I think your logic is wrong since this mask does not filter
If the value is 2 it will not be filtered because it is different from 10 and vice versa. I think you wantSeries.isin + ~:
train[~train["A"].isin([2,10])]
or &
train[train["A"].ne(2) & train["A"].ne(10)]

How to include a string being equal to itself shifted as a coniditon in a function definition?

I'm defining a simple if xxxx return y - else return NaN function. If the record, ['Product'], equals ['Product'] offset by 8 then the if condition is true.
I've tried calling the record and setting it equal to itself offset by 8 using == and .shift(8). ['Product'] is a string and ['Sales'] is an integer.
def Growth (X):
if X['Product'] == X['Product'].shift(8):
return (1+ X['Sales'].shift(4)) / (1+ X['Sales'].shift(8) - 1)
else:
return 'NaN'
I expect the output to be NaN for the first 8 records, and then to have numbers at record 9, but I receive the error instead.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Firstly a general comment from StackOverflow's Truth value of a Series is ambiguous...:
The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use "bitwise" | (or) or & (and) operations.
Secondly, you use == on Series objects. For this Pandas tries to convert the first object to a truth value - and fails, because this is ambiguous.
use X['Product'].equals(X['Product'].shift(8))

Creating a DataFrame in Pandas using logical and operator

I have a pandas Datarame with some columns and I first wanted to print only those rows whose values in a particular column is less than a certain value. So I did:
df[df.marks < 4.5]
It successfully created the dataframe, now I want to add only those columns whose values are in a certain range, so I tried this:
df[(df.marks < 4.5 and df.marks > 4)]
but it's giving me an error:
712 raise ValueError("The truth value of a {0} is ambiguous. "
713 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 714 .format(self.__class__.__name__))
715
716 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
How do I resolve this? And initially I also thought that it would iterate through all the rows and check the truth value and then add the row in the dataframe, but it seems like this isn't the case, if so how does it add the row in the dataframe?
Use
df[(df.marks < 4.5) & (df.marks > 4)]
Slightly more generally, array logical operations are combined using parentheses around the individual conditions:
(a < b) & (c > d)
Similar for OR-combinations, or more than 2 conditions.
This is how it's set up in NumPy, with boolean operators on arrays, and Pandas has copied that behaviour.
I've run into that problem before. not 100% on cause but dataframe object does not like multiple conditions together.
df[(df.marks < 4.5 and df.marks > 4)] -> will fail
Doing something like this usually will solve the problem.
df[(df.marks < 4.5)] [(df.marks > 4)]
That project is not at the top of my head right now, but i think quoting them separately works too.
First, Evert's solution is nice.
I add another 2 possible solutions:
... with between:
df = pd.DataFrame({'marks':[4.2,4,4.4,3,4.5]})
print (df)
marks
0 4.2
1 4.0
2 4.4
3 3.0
4 4.5
df = df[df.marks.between(4,4.5, inclusive=False)]
print (df)
marks
0 4.2
2 4.4
... with query:
df = df.query("marks < 4.5 & marks > 4")
print (df)
marks
0 4.2
2 4.4

Categories

Resources