Choosing rows from a dataframe based on multiple functions [duplicate] - python

This question already has answers here:
pandas: multiple conditions while indexing data frame - unexpected behavior
(5 answers)
Closed 3 years ago.
I have a dataframe df with a column "A". How do I choose a subset of df based on multiple conditions. I am trying:
train.loc[(train["A"] != 2) or (train["A"] != 10)]
The or operator doesnt seem to be working. How can I fix this? I got the error:
ValueError Traceback (most recent call last)
<ipython-input-30-e949fa2bb478> in <module>
----> 1 sub_train.loc[(sub_train["primary_use"] != 2) or (sub_train["primary_use"] != 10), "year_built"]
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in __nonzero__(self)
1553 "The truth value of a {0} is ambiguous. "
1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555 self.__class__.__name__
1556 )
1557 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Use | for bitwise OR or & for bitwise AND, also loc is not necessary:
#filter 2 or 10
train[(train["A"] == 2) | (train["A"] == 10)]
#filter not 2 and not 10
train[(train["A"] != 2) & (train["A"] != 10)]
If want also select some columns then is necessary:
train.loc[(train["A"] == 2) | (train["A"] == 10), 'B']

you need | instead of OR to do logic with Series:
train.loc[(train["A"] != 2) | (train["A"] != 10)]
To not worry about parentheses use Series.ne.
loc here in principle is not necessary if you do not want to select a specific column:
train[train["A"].ne(2) | train["A"].ne(10)]
But I think your logic is wrong since this mask does not filter
If the value is 2 it will not be filtered because it is different from 10 and vice versa. I think you wantSeries.isin + ~:
train[~train["A"].isin([2,10])]
or &
train[train["A"].ne(2) & train["A"].ne(10)]

Related

ValueError while iterating over dataframe: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() [duplicate]

This question already has answers here:
How to add value to column conditional on other column
(1 answer)
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 1 year ago.
I have a dataframe in which I am trying to convert the values in "LoginTime" to a 24HR format based on whether the "Timing" contains "am" or "pm".
data = """
LoginDate LoginTime Timing StudentId
2021-03-23 12 am 3574
2021-03-23 12 am 3574
2021-03-23 12 am 2512
2021-03-23 12 am 2692
2021-03-23 12 am 3064
"""
df = pd.read_csv(StringIO(data.strip()), sep='\s+')
I am using the following logic to convert the values:
for index in df.index:
if (df.loc[index,"Timing"] == "pm"):
df.loc[index, "LoginTime"] = df.loc[index, "LoginTime"] + 12
However, this gives me the following error:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11688/1623466071.py in <module>
1 for index in df.index:
----> 2 if (df.loc[index,"Timing"] == "pm"):
3 df.loc[index, "LoginTime"] = df.loc[index, "LoginTime"] + 12
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1535 #final
1536 def __nonzero__(self):
-> 1537 raise ValueError(
1538 f"The truth value of a {type(self).__name__} is ambiguous. "
1539 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
It is worth noting that I have set the index of the Dataframe as "LoginDate" which is of datetime format. However, when I change the index to normal integer values (0,1,2,3,...) and keep "LoginDate" as a normal column label, the above error disappears and the code executes properly.
How do I make the code work while keeping the index as "LoginDate" ?
Do not use a loop for your operation, use a vector approach:
df['LoginTime'] = df['LoginTime'].where(df['Timing'].ne('pm'), df['LoginTime']+12)
This is simpler to read and more efficient
You could try this :
df["LoginTime"] = np.where(df["Timing"] == "pm", df["LoginTime"] + 12, df["LoginTime"])

Multiple if conditions pandas

Looking to write an if statement which does a calculation based on if 3 conditions across other columns in a dataframe are true. I have tried the below code which seems to have worked for others on stackoverflow but kicks up an error for me. Note the 'check', 'sqm' and 'sqft' columns are in float64 format.
if ((merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)):
merge['checksqm'] == merge['sqft']/10.7639
#Error below:
alueError Traceback (most recent call last)
<ipython-input-383-e84717fde2c0> in <module>
----> 1 if ((merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)):
2 merge['checksqm'] == merge['sqft']/10.7639
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
1327
1328 def __nonzero__(self):
-> 1329 raise ValueError(
1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Each condition you code evaluates into a series of multiple boolean values. The combined result of the 3 conditions also become a boolean series. Python if statement cannot handle such Pandas series with evaluating each element in the series and feed to the statement following it one by one. Hence, the error ValueError: The truth value of a Series is ambiguous.
To solve the problem, you have to code it using Pandas syntax, like the following:
mask = (merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)
merge.loc[mask, 'checksqm'] = merge['sqft']/10.7639
or, combine in one statement, as follows:
merge.loc[(merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0), 'checksqm'] = merge['sqft']/10.7639
In this way, Pandas can evaluate the boolean series and work on the rows corresponding to True values of the combined 3 conditions and process each row one by one taking corresponding values from each row for processing. This kind of vectorized operation under the scene is not supported by ordinary Python statement such as if statement.
You are trying to use pd.Series as the condition inside the if clause. This series is a mask of True, False values. You need to cast the series to bool using series.any() or series.all().

pandas.DataFrame why use parenthesis to wrap operations to make bitwise comparison

A DataFrame is called c, and it has a column called price in which I want to know the rows with price equal to 2 or 3. And the code works here
c[(c['price'] == 2) | (c['price'] == 3)]
But doesn't work here:
c[c['price'] == 2 | c['price'] == 3]
and raises an exception:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The only difference is in the second line of code, there's no parenthesis '()' wrapped to the operation. So why the parenthesis is so important?
Thank you very much!
As per Pandas: Boolean indexing
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A > 2) & (df['B'] < 3)

How to fix "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" in Python Pandas? [duplicate]

This question already has answers here:
Conditional Replace Pandas
(7 answers)
Closed 3 years ago.
I have a dataset where I have two time stamp columns, one is the start time and the other is the end time. I have calculated the difference and also stored it in another column in the dataset. Based on the difference column of the dataset, I want to fill in a value in another column. I am using for loop and if else for the same but upon execution, the error "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" appears
Time_df = pd.read_excel('filepath')
print(Time_df.head(20))
for index, rows in Time_df.head().iterrows():
if(Time_df["Total Time"] < 6.00 ):
Time_df["Code"] = 1
print(Time_df.head(20))
In Total Downtime, wherever a less than 6 is encountered, it will put 1 in the column code. However, I get the error as stated in the question.
Try with np.where():
df["Code"]= np.where(df["Total Time"]<6.00,1,df["Code"])
Explanation:
#np.where(condition, choice if condition is met, choice if condition is not met)
#returns an array explained above
To fix your code
print(Time_df.head(20))
for index, rows in Time_df.head().iterrows():
if(rows["Total Time"] < 6.00 ):
Time_df.loc[index,"Code"] = 1
print(Time_df.head(20))
This happens to me a lot. In if (Time_df["Total Time"] < 6.00 ), (Time_df["Total Time"] < 6.00 ) is a series and Python does not know how to evaluate the series as a Boolean. Depending on what you want, but most likely you want to do:
Time_df.loc[Time_df["Total Time"] < 6.00, "Code"] = 1
which puts 1 in column "Code" wherever "Total Time" is < 6.
def myfn(row):
if row['Total Time'] < 6:
return 1
time_df['code'] = time_df.apply(lambda row: myfn(row), axis=1)

Creating a DataFrame in Pandas using logical and operator

I have a pandas Datarame with some columns and I first wanted to print only those rows whose values in a particular column is less than a certain value. So I did:
df[df.marks < 4.5]
It successfully created the dataframe, now I want to add only those columns whose values are in a certain range, so I tried this:
df[(df.marks < 4.5 and df.marks > 4)]
but it's giving me an error:
712 raise ValueError("The truth value of a {0} is ambiguous. "
713 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 714 .format(self.__class__.__name__))
715
716 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
How do I resolve this? And initially I also thought that it would iterate through all the rows and check the truth value and then add the row in the dataframe, but it seems like this isn't the case, if so how does it add the row in the dataframe?
Use
df[(df.marks < 4.5) & (df.marks > 4)]
Slightly more generally, array logical operations are combined using parentheses around the individual conditions:
(a < b) & (c > d)
Similar for OR-combinations, or more than 2 conditions.
This is how it's set up in NumPy, with boolean operators on arrays, and Pandas has copied that behaviour.
I've run into that problem before. not 100% on cause but dataframe object does not like multiple conditions together.
df[(df.marks < 4.5 and df.marks > 4)] -> will fail
Doing something like this usually will solve the problem.
df[(df.marks < 4.5)] [(df.marks > 4)]
That project is not at the top of my head right now, but i think quoting them separately works too.
First, Evert's solution is nice.
I add another 2 possible solutions:
... with between:
df = pd.DataFrame({'marks':[4.2,4,4.4,3,4.5]})
print (df)
marks
0 4.2
1 4.0
2 4.4
3 3.0
4 4.5
df = df[df.marks.between(4,4.5, inclusive=False)]
print (df)
marks
0 4.2
2 4.4
... with query:
df = df.query("marks < 4.5 & marks > 4")
print (df)
marks
0 4.2
2 4.4

Categories

Resources