Creating a DataFrame in Pandas using logical and operator - python

I have a pandas Datarame with some columns and I first wanted to print only those rows whose values in a particular column is less than a certain value. So I did:
df[df.marks < 4.5]
It successfully created the dataframe, now I want to add only those columns whose values are in a certain range, so I tried this:
df[(df.marks < 4.5 and df.marks > 4)]
but it's giving me an error:
712 raise ValueError("The truth value of a {0} is ambiguous. "
713 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 714 .format(self.__class__.__name__))
715
716 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
How do I resolve this? And initially I also thought that it would iterate through all the rows and check the truth value and then add the row in the dataframe, but it seems like this isn't the case, if so how does it add the row in the dataframe?

Use
df[(df.marks < 4.5) & (df.marks > 4)]
Slightly more generally, array logical operations are combined using parentheses around the individual conditions:
(a < b) & (c > d)
Similar for OR-combinations, or more than 2 conditions.
This is how it's set up in NumPy, with boolean operators on arrays, and Pandas has copied that behaviour.

I've run into that problem before. not 100% on cause but dataframe object does not like multiple conditions together.
df[(df.marks < 4.5 and df.marks > 4)] -> will fail
Doing something like this usually will solve the problem.
df[(df.marks < 4.5)] [(df.marks > 4)]
That project is not at the top of my head right now, but i think quoting them separately works too.

First, Evert's solution is nice.
I add another 2 possible solutions:
... with between:
df = pd.DataFrame({'marks':[4.2,4,4.4,3,4.5]})
print (df)
marks
0 4.2
1 4.0
2 4.4
3 3.0
4 4.5
df = df[df.marks.between(4,4.5, inclusive=False)]
print (df)
marks
0 4.2
2 4.4
... with query:
df = df.query("marks < 4.5 & marks > 4")
print (df)
marks
0 4.2
2 4.4

Related

Python, comparing dataframe rows, adding new column - Truth Value Error?

I am quite new to Python, and somewhat stuck here.
I just want to compare floats with a previous or forward row in a dataframe, and mark a new column accordingly. FWIW, I have 6000 rows to compare. I need to output a string or int as the result.
My Python code:
for row in df_an:
x = df_an['mid_h'].shift(1)
y = df_an['mid_h']
if y > x:
df_an['bar_type'] = 1
I get the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The x and y variables are generated, but apparently things go wrong with the if y > x: statement.
Any advice much appreciated.
A different approach...
I managed to implement the suggested .gt operator.
df_an.loc[df_an.mid_h.gt(df_an.mid_h.shift()) &\
df_an.mid_l.gt(df_an.mid_l.shift()), "bar_type"] = UP
Instead of row; you basically shifting whole row then comparing; try this once;
df_an = pd.DataFrame({"mid_h":[1,3,5,7,7,5,3]})
df_an['bar_type'] = df_an.mid_h.gt(df_an.mid_h.shift())
# Output
mid_h bar_type
0 1 False
1 3 True
2 5 True
3 7 True
4 7 False
5 5 False
6 3 False

Multiple if conditions pandas

Looking to write an if statement which does a calculation based on if 3 conditions across other columns in a dataframe are true. I have tried the below code which seems to have worked for others on stackoverflow but kicks up an error for me. Note the 'check', 'sqm' and 'sqft' columns are in float64 format.
if ((merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)):
merge['checksqm'] == merge['sqft']/10.7639
#Error below:
alueError Traceback (most recent call last)
<ipython-input-383-e84717fde2c0> in <module>
----> 1 if ((merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)):
2 merge['checksqm'] == merge['sqft']/10.7639
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
1327
1328 def __nonzero__(self):
-> 1329 raise ValueError(
1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Each condition you code evaluates into a series of multiple boolean values. The combined result of the 3 conditions also become a boolean series. Python if statement cannot handle such Pandas series with evaluating each element in the series and feed to the statement following it one by one. Hence, the error ValueError: The truth value of a Series is ambiguous.
To solve the problem, you have to code it using Pandas syntax, like the following:
mask = (merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0)
merge.loc[mask, 'checksqm'] = merge['sqft']/10.7639
or, combine in one statement, as follows:
merge.loc[(merge['check'] == 1) & (merge['sqft'] > 0) & (merge['sqm'] == 0), 'checksqm'] = merge['sqft']/10.7639
In this way, Pandas can evaluate the boolean series and work on the rows corresponding to True values of the combined 3 conditions and process each row one by one taking corresponding values from each row for processing. This kind of vectorized operation under the scene is not supported by ordinary Python statement such as if statement.
You are trying to use pd.Series as the condition inside the if clause. This series is a mask of True, False values. You need to cast the series to bool using series.any() or series.all().

Choosing rows from a dataframe based on multiple functions [duplicate]

This question already has answers here:
pandas: multiple conditions while indexing data frame - unexpected behavior
(5 answers)
Closed 3 years ago.
I have a dataframe df with a column "A". How do I choose a subset of df based on multiple conditions. I am trying:
train.loc[(train["A"] != 2) or (train["A"] != 10)]
The or operator doesnt seem to be working. How can I fix this? I got the error:
ValueError Traceback (most recent call last)
<ipython-input-30-e949fa2bb478> in <module>
----> 1 sub_train.loc[(sub_train["primary_use"] != 2) or (sub_train["primary_use"] != 10), "year_built"]
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in __nonzero__(self)
1553 "The truth value of a {0} is ambiguous. "
1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555 self.__class__.__name__
1556 )
1557 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Use | for bitwise OR or & for bitwise AND, also loc is not necessary:
#filter 2 or 10
train[(train["A"] == 2) | (train["A"] == 10)]
#filter not 2 and not 10
train[(train["A"] != 2) & (train["A"] != 10)]
If want also select some columns then is necessary:
train.loc[(train["A"] == 2) | (train["A"] == 10), 'B']
you need | instead of OR to do logic with Series:
train.loc[(train["A"] != 2) | (train["A"] != 10)]
To not worry about parentheses use Series.ne.
loc here in principle is not necessary if you do not want to select a specific column:
train[train["A"].ne(2) | train["A"].ne(10)]
But I think your logic is wrong since this mask does not filter
If the value is 2 it will not be filtered because it is different from 10 and vice versa. I think you wantSeries.isin + ~:
train[~train["A"].isin([2,10])]
or &
train[train["A"].ne(2) & train["A"].ne(10)]

How to fix "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" in Python Pandas? [duplicate]

This question already has answers here:
Conditional Replace Pandas
(7 answers)
Closed 3 years ago.
I have a dataset where I have two time stamp columns, one is the start time and the other is the end time. I have calculated the difference and also stored it in another column in the dataset. Based on the difference column of the dataset, I want to fill in a value in another column. I am using for loop and if else for the same but upon execution, the error "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" appears
Time_df = pd.read_excel('filepath')
print(Time_df.head(20))
for index, rows in Time_df.head().iterrows():
if(Time_df["Total Time"] < 6.00 ):
Time_df["Code"] = 1
print(Time_df.head(20))
In Total Downtime, wherever a less than 6 is encountered, it will put 1 in the column code. However, I get the error as stated in the question.
Try with np.where():
df["Code"]= np.where(df["Total Time"]<6.00,1,df["Code"])
Explanation:
#np.where(condition, choice if condition is met, choice if condition is not met)
#returns an array explained above
To fix your code
print(Time_df.head(20))
for index, rows in Time_df.head().iterrows():
if(rows["Total Time"] < 6.00 ):
Time_df.loc[index,"Code"] = 1
print(Time_df.head(20))
This happens to me a lot. In if (Time_df["Total Time"] < 6.00 ), (Time_df["Total Time"] < 6.00 ) is a series and Python does not know how to evaluate the series as a Boolean. Depending on what you want, but most likely you want to do:
Time_df.loc[Time_df["Total Time"] < 6.00, "Code"] = 1
which puts 1 in column "Code" wherever "Total Time" is < 6.
def myfn(row):
if row['Total Time'] < 6:
return 1
time_df['code'] = time_df.apply(lambda row: myfn(row), axis=1)

How to update a Python Dataframe column dependent on the presence of a substring in another column

So I have a dataframe containing a float64 type column and an object type column containing a string.
If object column contains substring 'abc' I want to subtract 12 from the float column. If object column contains substring 'def' I want to subtract 24 from the float column. If object column contains neither 'abc' or 'def', I want to leave float column as is.
Example:
Nmbr Strng
52 abcghi
80 defghi
10 ghijkl
Expected output:
Nmbr Strng
40 abcghi
56 defghi
10 ghijkl
I have tried the following but keep getting an error:
if df.Strng.str.contains("abc"):
df.Nmbr = (df.Nmbr - 12)
elif df.Strng.str.contains("def"):
df.Nmbr = (df.Nmbr - 24)
else:
df.Nmbr = df.Nmbr
The error I'm getting is as follows:
915 raise ValueError("The truth value of a {0} is ambiguous. "
916 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
917 .format(self.__class__.__name__))
918
919 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Note:Line 917 is the one that's highlighted as the error.
Your error occurs because you are testing whether a Boolean series is True or False. This is not possible. You could test if all or any values are True, to return a single Boolean, but this isn't what you are looking for.
It is good practice to vectorize your calculations rather than introduce loops. Below is how you can implement your logic via the .loc accessor.
df.loc[df['Strng'].str.contains('abc', regex=False, na=False), 'Nmbr'] -= 12
df.loc[df['Strng'].str.contains('def', regex=False, na=False), 'Nmbr'] -= 24
Result:
Nmbr Strng
0 40 abcghi
1 56 defghi
2 10 ghijkl

Categories

Resources