Filling missing data based on the variable data distribution - python

I am trying to create a for loop that fills in missing values across 50+ variables. The logic I have applied is that if a variable (cols) fulfils mode>median>mean or mode<median<mean (i.e. skewed) the missing values within the variable should be filled with the median of the variable. If the mode=median=mean (i.e. normal distribution) then the variable missing values should be filled with the mean of the variable. If the variable then does not fulfil the conditions, the missing values within the variable are filled with the median. I have been getting the following error:-
‘ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().’
I have a slight understanding of the error however am unsure of how to solve the problem. I began taking the approach using if condition statements for pandas but still got an error. I have pasted below my code. Many thanks for your help in advance!
Approach 1
#filling data based on the variable distribution
for cols in num_cols2:
if ((df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())) | ((df[cols].mean() > df[cols].median()) & (df[cols].median() > df[cols].mode())):
df[cols]=df[cols].fillna(df.median())
elif ((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode())):
df[cols]=df[cols].fillna(df.mean().iloc[0])
else:
df[cols]=df[cols].fillna(df.median())
Error message below
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/admin/Library/CloudStorage/OneDrive-Personal/DA Material/Data Science 6/EDAPipeDetectionleak.ipynb Cell 34 in <cell line: 3>()
1 #filling data based on distribution
3 for cols in num_cols2:
----> 4 if ((df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())) | ((df[cols].mean() > df[cols].median()) & (df[cols].median() > df[cols].mode())):
5 df[cols]=df[cols].fillna(df.median())
6 elif ((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode())):
File /opt/homebrew/lib/python3.10/site-packages/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
1525 #final
1526 def __nonzero__(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).__name__} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1530 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I also tried the following approaches:-
Approach 2 outputted the same error as above
for cols in num_cols2:
df[cols] = df[cols].apply(lambda cols:(df[cols].fillna(df.median()))) if (df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode()) else (df[cols].fillna(df.mean()))
Approach 3
for cols in num_cols2:
df.loc[(df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())] = df[cols].fillna(df.median())
df.loc[df[cols].mean() > df[cols].median() & (df[cols].median() > df[cols].mode())] = df[cols].fillna(df.median())
df.loc[((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode()))] = df[cols].fillna(df.mean().iloc[0])
for cols in num_cols2:
df[cols] = df.loc[(df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())] = df[cols].fillna(df.median())
df[cols] = df.loc[df[cols].mean() > df[cols].median() & (df[cols].median() > df[cols].mode())] = df[cols].fillna(df.median())
df[cols] = df.loc[((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode()))] = df[cols].fillna(df.mean().iloc[0])
Error output for approach 3 is shown below
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Because working with scalars use and or or, for Series.mode return first value:
for col in num_cols2:
avg = df[col].mean()
med = df[col].median()
mod = df[col].mode().iat[0]
if (avg == med) and (med == mod):
df[col]=df[col].fillna(avg)
else:
df[col]=df[col].fillna(med)
But because avg is same like median for if condition above, you can simplify solution by replace missing values by median:
df[num_cols2] = df[num_cols2].fillna(df[num_cols2].median())

Related

Python: Trying to encode all DF strings into ints

I am trying to build a function that checks every column of the DF and if it finds a string in the column, I want to encode that string into the int. The function also ensures that the binary model label is rounded to 1 and 0.
However, this gets me an ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error is caused by everything within an IF statement, which makes me wonder if there is a better way to approach this problem?
This is what I got so far:
credit_key = "consumer"
model_label = "sc"
def df_column_headers(df):
headers = []
for column in df.columns:
headers.append(column)
return headers
def control_df_modelling(path_df_modelling, credit_key, model_label):
path_df_modelling = path_df_modelling + f"{file_name_modelling}_{credit_key}.{file_format_modelling}"
df_modelling = pd.read_csv(path_df_modelling)
print(f"Conducting Quality Checks: DF Modelling {credit_key}")
count_string_values = 0
df_modelling[model_label] = [round(x) for x in df_modelling[model_label]]
columns_df_modelling = df_column_headers(df_modelling)
for column in df_modelling[columns_df_modelling]:
column_element = df_modelling[column]
if column_element != np.number:
print(f"String Value Found: {column_element}")
print(f"String Value {column_element} is Converted into {hash(column_element)}")
count_string_values += 1
column_element = hash(column_element)
else:
print(f"No String Values in DF {credit_key} Found")
print(f"Total Number of String Values Found in DF Modelling {credit_key}: {count_string_values}")
return df_modelling

The truth value of a Series is ambiguous. I have tried using the available answers but nothing worked

I'm getting error
"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()".
I have tried replacing and with '&' but still it didn't work.
roc_table = pd.DataFrame(columns = ['score','TP', 'FP','TN','FN'])
TP=0
FP=0
TN=0
FN=0
df_roc = pd.DataFrame([["score", "status"], [100, "accept"], [-80, "reject"]])
for score in range(-100,-80,5):
for row in df_roc.iterrows():
if (df_roc['score'] >= score) & (df_roc['status'] == 'reject'):
TP=TP+1
elif (df_roc['score'] >= score) & (df_roc['status'] == 'accept'):
FP=FP+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'accept'):
TN=TN+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'reject'):
FN=FN+1
dict = {'score':score, 'TP': TP, 'FP': FP, 'TN': TN,'FN':FN}
roc_table = roc_table.append(dict, ignore_index = True)
sample of df_roc:
score
status
100
accept
-80
reject
df_roc['score'] >= score
The error is telling you that this comparison makes no sense.
df_roc['score'] is a column containing many values. Some may be less than score, and others may be greater.
Since this code is inside a for row in df_roc.iterrows() loop, I think you intended to compare just the score from the current row, but that's not what you actually did.
I replaced "df_roc['score']" with "row['score']" in the code and it worked. Thanks
(row['score'] >= score) & (row['status'] == 'reject')

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()/ select subset from data [duplicate]

This question already has answers here:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
(13 answers)
Closed 2 years ago.
My input file is under the form:
gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold
T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
....
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace,
T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
and I would like to select rows that either have the values for (CallersU equal to either Low OR -1) AND the values of (CalleesU equal either to Low OR -1).
Here is the code I am using below:
import pandas as pd
SeparateProjectLearning=False
CompleteCallersCallees=False
PartialTrainingSetCompleteCallersCallees=True
def main():
dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False)
#convert strings into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
dataset['Program'] = dataset['Program'].astype('category').cat.codes
dataset['classGold'] = dataset['classGold'].astype('category').cat.codes
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes
dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes
dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes
dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes
dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes
dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes
dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes
dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes
dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes
dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes
dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes
dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes
print(dataset)
CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2)
and (dataset['CalleesU']==0 or dataset['CalleesU']==2)]
print(CompleteSet)
if __name__=="__main__":
main()
I am using the line dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes to convert the string values that can be taken by CallersU into digits. Similarly, I am using the line of code dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes to convert the string values that can be taken by CalleesU into digits. The four values that can be taken by CallersU/CalleesU are -1, Low,Medium,High. The line ...astype('category').cat.codes automatically makes the following conversions. -1 corresponds to 0, 1 Corresponds to High, 2 corresponds to Low and 3 corresponds to Medium. Thus, I am using the line CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)] to specify that I only want to select rows with either (CallersU==0 OR CallersU==2) and (CalleesU==0 OR CalleesU==2), the problem is that I am getting the error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). after executing the line of code CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)], How can I fix that and perform what's needed?
Replace the and with & and add ()
CompleteSet = dataset[((dataset['CallersU'] == 0) | (dataset['CallersU'] == 2)) & ((dataset['CalleesU']==0) | (dataset['CalleesU']==2))]

Print out a specific set of rows of a dataset based on conditions

What I am trying:
import re
new_df = census_df.loc[(census_df['REGION']==1 | census_df['REGION']== 2) & (census_df['CTYNAME'].str.contains('^Washington[a-z]*'))& (census_df['POPESTIMATE2015']>census_df['POPESTIMATE2014'])]
new_df
It returns this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You need to set brackets around each logical expression in filt_1:
filt_1 = (census_df['REGION'] == 1) | (census_df['REGION'] == 2)
Note that my data for census_df is semi-fictitious but shows the functionality. Everything from the filt_1 assignment operation and downwards will still work for your entire census_df dataframe. This is the full program:
import pandas as pd
cols = ['REGION', 'CTYNAME', 'POPESTIMATE2014', 'POPESTIMATE2015']
data = [[1, "Washington", 4846411, 4858979],
[3, "Autauga County", 55290, 55347]]
census_df = pd.DataFrame(data, columns=cols)
filt_1 = (census_df['REGION'] == 1) | (census_df['REGION'] == 2)
filt_2 = census_df['CTYNAME'].str.contains("^Washington[a-z]*")
filt_3 = census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014']
filt = filt_1 & filt_2 & filt_3
new_df = census_df.loc[filt]
print(new_df)
Returns:
REGION CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 1 Washington 4846411 4858979

Ambiguous Truth Value - For Loop & If Statements

Chasing the reasoning behind the following error message.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is the code I am running:
anz_analysis["Action"] = anz_analysis["Signal"]
for i in range(0, len(anz_analysis) + 1):
if ((anz_analysis["Signal"].iloc[[i]] == "Buy") & (anz_analysis["Signal"].iloc[[i+1]] == "Buy")):
anz_analysis["Action"] = anz_analysis["Signal"].iloc[[i]] = "Maintain"
elif (anz_analysis["Signal"].iloc[[i]] = "Sell") & (anz_analysis["Signal"].iloc[[i+1]] = "Sell"):
anz_analysis["Signal"].iloc[[i]] = "Maintain"
The dataframe looks like this:
Current: Wanted:
_______|_________
1|Buy | Buy
2|Buy | Maintain
3|Buy | Maintain
4|Sell | Sell
5|Sell | Maintain
6|Sell | Maintain
Any help would be appreciated.
Using .iloc[[i]] returns a Series. So even though there is only one value in the Series, that ValueError arises, because you are comparing a Series to a str.
One way to deal with this is to use .iloc[[i]][0]. That will 'extract' the string and allow the comparison to be made between the same types.
A better way would be to select the row and column with iloc. E.g.
anz_analysis["Action"] = anz_analysis["Signal"]
for i in range(len(anz_analysis) + 1):
if ((anz_analysis.iloc[i, 'Signal'] == "Buy") & (anz_analysis.iloc[i + 1, 'Signal'] == "Buy")):
anz_analysis["Action"] = anz_analysis.iloc[i, 'Signal'] = "Maintain"
elif (anz_analysis.iloc[i, 'Signal'] == "Sell") & (anz_analysis.iloc[i + 1, 'Signal'] == "Sell"):
anz_analysis.iloc[i, 'Signal'] = "Maintain"
Edit
Updated the elif statement as per #6502's observation of the likely error in OP's code there. Also decided to remove the spurious start value of 0 in range.

Categories

Resources