Chasing the reasoning behind the following error message.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is the code I am running:
anz_analysis["Action"] = anz_analysis["Signal"]
for i in range(0, len(anz_analysis) + 1):
if ((anz_analysis["Signal"].iloc[[i]] == "Buy") & (anz_analysis["Signal"].iloc[[i+1]] == "Buy")):
anz_analysis["Action"] = anz_analysis["Signal"].iloc[[i]] = "Maintain"
elif (anz_analysis["Signal"].iloc[[i]] = "Sell") & (anz_analysis["Signal"].iloc[[i+1]] = "Sell"):
anz_analysis["Signal"].iloc[[i]] = "Maintain"
The dataframe looks like this:
Current: Wanted:
_______|_________
1|Buy | Buy
2|Buy | Maintain
3|Buy | Maintain
4|Sell | Sell
5|Sell | Maintain
6|Sell | Maintain
Any help would be appreciated.
Using .iloc[[i]] returns a Series. So even though there is only one value in the Series, that ValueError arises, because you are comparing a Series to a str.
One way to deal with this is to use .iloc[[i]][0]. That will 'extract' the string and allow the comparison to be made between the same types.
A better way would be to select the row and column with iloc. E.g.
anz_analysis["Action"] = anz_analysis["Signal"]
for i in range(len(anz_analysis) + 1):
if ((anz_analysis.iloc[i, 'Signal'] == "Buy") & (anz_analysis.iloc[i + 1, 'Signal'] == "Buy")):
anz_analysis["Action"] = anz_analysis.iloc[i, 'Signal'] = "Maintain"
elif (anz_analysis.iloc[i, 'Signal'] == "Sell") & (anz_analysis.iloc[i + 1, 'Signal'] == "Sell"):
anz_analysis.iloc[i, 'Signal'] = "Maintain"
Edit
Updated the elif statement as per #6502's observation of the likely error in OP's code there. Also decided to remove the spurious start value of 0 in range.
Related
I am trying to build a function that checks every column of the DF and if it finds a string in the column, I want to encode that string into the int. The function also ensures that the binary model label is rounded to 1 and 0.
However, this gets me an ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error is caused by everything within an IF statement, which makes me wonder if there is a better way to approach this problem?
This is what I got so far:
credit_key = "consumer"
model_label = "sc"
def df_column_headers(df):
headers = []
for column in df.columns:
headers.append(column)
return headers
def control_df_modelling(path_df_modelling, credit_key, model_label):
path_df_modelling = path_df_modelling + f"{file_name_modelling}_{credit_key}.{file_format_modelling}"
df_modelling = pd.read_csv(path_df_modelling)
print(f"Conducting Quality Checks: DF Modelling {credit_key}")
count_string_values = 0
df_modelling[model_label] = [round(x) for x in df_modelling[model_label]]
columns_df_modelling = df_column_headers(df_modelling)
for column in df_modelling[columns_df_modelling]:
column_element = df_modelling[column]
if column_element != np.number:
print(f"String Value Found: {column_element}")
print(f"String Value {column_element} is Converted into {hash(column_element)}")
count_string_values += 1
column_element = hash(column_element)
else:
print(f"No String Values in DF {credit_key} Found")
print(f"Total Number of String Values Found in DF Modelling {credit_key}: {count_string_values}")
return df_modelling
I am trying to create a for loop that fills in missing values across 50+ variables. The logic I have applied is that if a variable (cols) fulfils mode>median>mean or mode<median<mean (i.e. skewed) the missing values within the variable should be filled with the median of the variable. If the mode=median=mean (i.e. normal distribution) then the variable missing values should be filled with the mean of the variable. If the variable then does not fulfil the conditions, the missing values within the variable are filled with the median. I have been getting the following error:-
‘ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().’
I have a slight understanding of the error however am unsure of how to solve the problem. I began taking the approach using if condition statements for pandas but still got an error. I have pasted below my code. Many thanks for your help in advance!
Approach 1
#filling data based on the variable distribution
for cols in num_cols2:
if ((df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())) | ((df[cols].mean() > df[cols].median()) & (df[cols].median() > df[cols].mode())):
df[cols]=df[cols].fillna(df.median())
elif ((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode())):
df[cols]=df[cols].fillna(df.mean().iloc[0])
else:
df[cols]=df[cols].fillna(df.median())
Error message below
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/admin/Library/CloudStorage/OneDrive-Personal/DA Material/Data Science 6/EDAPipeDetectionleak.ipynb Cell 34 in <cell line: 3>()
1 #filling data based on distribution
3 for cols in num_cols2:
----> 4 if ((df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())) | ((df[cols].mean() > df[cols].median()) & (df[cols].median() > df[cols].mode())):
5 df[cols]=df[cols].fillna(df.median())
6 elif ((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode())):
File /opt/homebrew/lib/python3.10/site-packages/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
1525 #final
1526 def __nonzero__(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).__name__} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1530 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I also tried the following approaches:-
Approach 2 outputted the same error as above
for cols in num_cols2:
df[cols] = df[cols].apply(lambda cols:(df[cols].fillna(df.median()))) if (df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode()) else (df[cols].fillna(df.mean()))
Approach 3
for cols in num_cols2:
df.loc[(df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())] = df[cols].fillna(df.median())
df.loc[df[cols].mean() > df[cols].median() & (df[cols].median() > df[cols].mode())] = df[cols].fillna(df.median())
df.loc[((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode()))] = df[cols].fillna(df.mean().iloc[0])
for cols in num_cols2:
df[cols] = df.loc[(df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())] = df[cols].fillna(df.median())
df[cols] = df.loc[df[cols].mean() > df[cols].median() & (df[cols].median() > df[cols].mode())] = df[cols].fillna(df.median())
df[cols] = df.loc[((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode()))] = df[cols].fillna(df.mean().iloc[0])
Error output for approach 3 is shown below
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
Because working with scalars use and or or, for Series.mode return first value:
for col in num_cols2:
avg = df[col].mean()
med = df[col].median()
mod = df[col].mode().iat[0]
if (avg == med) and (med == mod):
df[col]=df[col].fillna(avg)
else:
df[col]=df[col].fillna(med)
But because avg is same like median for if condition above, you can simplify solution by replace missing values by median:
df[num_cols2] = df[num_cols2].fillna(df[num_cols2].median())
I am implementing my own function for calculating taxes. Below you can see the data
df = pd.DataFrame({"id_n":["1","2","3","4","5"],
"sales1":[0,115000,440000,500000,740000],
"sales2":[0,115000,460000,520000,760000],
"tax":[0,8050,57500,69500,69500]
})
Now I want to introduce a tax function that needs to give the same results as results in column tax. Below you can see an estimation of that function:
# Thresholds
min_threeshold = 500000
max_threeshold = 1020000
# Maximum taxes
max_cap = 69500
# Rates
rate_1 = 0.035
rate_2 = 0.1
# Total sales
total_sale = df['sales1'] + df['sales2']
tax = df['tax']
# Function for estimation
def tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2):
if (total_sale > 0 and tax == 0):
calc_tax = 0
elif (total_sale < min_threeshold):
calc_tax = total_sale * rate_1
elif (total_sale >= min_threeshold) & (total_sale <= max_threeshold):
calc_tax = total_sale * rate_2
elif (total_sale > max_threeshold):
calc_tax = max_cap
return calc_tax
So far so good. The next step is the execution of the above function. Below you can see the command :
df['new_tax']=tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
After execution of this command, I received this error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So can anybody help me how to solve this problem?
You want to use df.apply(), you're almost there!
def someFunc():
# do something here
return result
def main():
# setup df
new_col = 'col_name'
df[new_col] = df.apply(lambda x: someFunc(x['some_col']))
This is powerful because within x you now have access to each row, so you can pass data from the row to your custom function and then apply it - using optimized pandas - to each row.
I forget the object structure for x, so you may want to look into that and you can access what you need. Also, x can be named anything.
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
Hope this helps!
I'm getting error
"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()".
I have tried replacing and with '&' but still it didn't work.
roc_table = pd.DataFrame(columns = ['score','TP', 'FP','TN','FN'])
TP=0
FP=0
TN=0
FN=0
df_roc = pd.DataFrame([["score", "status"], [100, "accept"], [-80, "reject"]])
for score in range(-100,-80,5):
for row in df_roc.iterrows():
if (df_roc['score'] >= score) & (df_roc['status'] == 'reject'):
TP=TP+1
elif (df_roc['score'] >= score) & (df_roc['status'] == 'accept'):
FP=FP+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'accept'):
TN=TN+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'reject'):
FN=FN+1
dict = {'score':score, 'TP': TP, 'FP': FP, 'TN': TN,'FN':FN}
roc_table = roc_table.append(dict, ignore_index = True)
sample of df_roc:
score
status
100
accept
-80
reject
df_roc['score'] >= score
The error is telling you that this comparison makes no sense.
df_roc['score'] is a column containing many values. Some may be less than score, and others may be greater.
Since this code is inside a for row in df_roc.iterrows() loop, I think you intended to compare just the score from the current row, but that's not what you actually did.
I replaced "df_roc['score']" with "row['score']" in the code and it worked. Thanks
(row['score'] >= score) & (row['status'] == 'reject')
What I am trying:
import re
new_df = census_df.loc[(census_df['REGION']==1 | census_df['REGION']== 2) & (census_df['CTYNAME'].str.contains('^Washington[a-z]*'))& (census_df['POPESTIMATE2015']>census_df['POPESTIMATE2014'])]
new_df
It returns this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You need to set brackets around each logical expression in filt_1:
filt_1 = (census_df['REGION'] == 1) | (census_df['REGION'] == 2)
Note that my data for census_df is semi-fictitious but shows the functionality. Everything from the filt_1 assignment operation and downwards will still work for your entire census_df dataframe. This is the full program:
import pandas as pd
cols = ['REGION', 'CTYNAME', 'POPESTIMATE2014', 'POPESTIMATE2015']
data = [[1, "Washington", 4846411, 4858979],
[3, "Autauga County", 55290, 55347]]
census_df = pd.DataFrame(data, columns=cols)
filt_1 = (census_df['REGION'] == 1) | (census_df['REGION'] == 2)
filt_2 = census_df['CTYNAME'].str.contains("^Washington[a-z]*")
filt_3 = census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014']
filt = filt_1 & filt_2 & filt_3
new_df = census_df.loc[filt]
print(new_df)
Returns:
REGION CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 1 Washington 4846411 4858979