SMOTE in python - python

I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples.
Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2, we get 100:200. But I am looking if there is any way to manually specify the number of instances to be generated for both the classes.
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts()
Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts()
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
In the above code, I am trying to manually specify the number for each of the classes, but I am getting the following error at the last line of code
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

If I understand you correctly and the documentation here, you are not passing numbers as ratio. You are instead passing a series object.
The accepted types for ratio are:
float, str, dict or callable, (default=’auto’)
Please try doing:
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS
Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
This should now work, please try!

Related

Python: Trying to encode all DF strings into ints

I am trying to build a function that checks every column of the DF and if it finds a string in the column, I want to encode that string into the int. The function also ensures that the binary model label is rounded to 1 and 0.
However, this gets me an ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error is caused by everything within an IF statement, which makes me wonder if there is a better way to approach this problem?
This is what I got so far:
credit_key = "consumer"
model_label = "sc"
def df_column_headers(df):
headers = []
for column in df.columns:
headers.append(column)
return headers
def control_df_modelling(path_df_modelling, credit_key, model_label):
path_df_modelling = path_df_modelling + f"{file_name_modelling}_{credit_key}.{file_format_modelling}"
df_modelling = pd.read_csv(path_df_modelling)
print(f"Conducting Quality Checks: DF Modelling {credit_key}")
count_string_values = 0
df_modelling[model_label] = [round(x) for x in df_modelling[model_label]]
columns_df_modelling = df_column_headers(df_modelling)
for column in df_modelling[columns_df_modelling]:
column_element = df_modelling[column]
if column_element != np.number:
print(f"String Value Found: {column_element}")
print(f"String Value {column_element} is Converted into {hash(column_element)}")
count_string_values += 1
column_element = hash(column_element)
else:
print(f"No String Values in DF {credit_key} Found")
print(f"Total Number of String Values Found in DF Modelling {credit_key}: {count_string_values}")
return df_modelling

Implementing own function in pandas data frame

I am implementing my own function for calculating taxes. Below you can see the data
df = pd.DataFrame({"id_n":["1","2","3","4","5"],
"sales1":[0,115000,440000,500000,740000],
"sales2":[0,115000,460000,520000,760000],
"tax":[0,8050,57500,69500,69500]
})
Now I want to introduce a tax function that needs to give the same results as results in column tax. Below you can see an estimation of that function:
# Thresholds
min_threeshold = 500000
max_threeshold = 1020000
# Maximum taxes
max_cap = 69500
# Rates
rate_1 = 0.035
rate_2 = 0.1
# Total sales
total_sale = df['sales1'] + df['sales2']
tax = df['tax']
# Function for estimation
def tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2):
if (total_sale > 0 and tax == 0):
calc_tax = 0
elif (total_sale < min_threeshold):
calc_tax = total_sale * rate_1
elif (total_sale >= min_threeshold) & (total_sale <= max_threeshold):
calc_tax = total_sale * rate_2
elif (total_sale > max_threeshold):
calc_tax = max_cap
return calc_tax
So far so good. The next step is the execution of the above function. Below you can see the command :
df['new_tax']=tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
After execution of this command, I received this error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So can anybody help me how to solve this problem?
You want to use df.apply(), you're almost there!
def someFunc():
# do something here
return result
def main():
# setup df
new_col = 'col_name'
df[new_col] = df.apply(lambda x: someFunc(x['some_col']))
This is powerful because within x you now have access to each row, so you can pass data from the row to your custom function and then apply it - using optimized pandas - to each row.
I forget the object structure for x, so you may want to look into that and you can access what you need. Also, x can be named anything.
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
Hope this helps!

The truth value of a Series is ambiguous. I have tried using the available answers but nothing worked

I'm getting error
"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()".
I have tried replacing and with '&' but still it didn't work.
roc_table = pd.DataFrame(columns = ['score','TP', 'FP','TN','FN'])
TP=0
FP=0
TN=0
FN=0
df_roc = pd.DataFrame([["score", "status"], [100, "accept"], [-80, "reject"]])
for score in range(-100,-80,5):
for row in df_roc.iterrows():
if (df_roc['score'] >= score) & (df_roc['status'] == 'reject'):
TP=TP+1
elif (df_roc['score'] >= score) & (df_roc['status'] == 'accept'):
FP=FP+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'accept'):
TN=TN+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'reject'):
FN=FN+1
dict = {'score':score, 'TP': TP, 'FP': FP, 'TN': TN,'FN':FN}
roc_table = roc_table.append(dict, ignore_index = True)
sample of df_roc:
score
status
100
accept
-80
reject
df_roc['score'] >= score
The error is telling you that this comparison makes no sense.
df_roc['score'] is a column containing many values. Some may be less than score, and others may be greater.
Since this code is inside a for row in df_roc.iterrows() loop, I think you intended to compare just the score from the current row, but that's not what you actually did.
I replaced "df_roc['score']" with "row['score']" in the code and it worked. Thanks
(row['score'] >= score) & (row['status'] == 'reject')

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()/ select subset from data [duplicate]

This question already has answers here:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
(13 answers)
Closed 2 years ago.
My input file is under the form:
gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold
T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
....
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace,
T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
and I would like to select rows that either have the values for (CallersU equal to either Low OR -1) AND the values of (CalleesU equal either to Low OR -1).
Here is the code I am using below:
import pandas as pd
SeparateProjectLearning=False
CompleteCallersCallees=False
PartialTrainingSetCompleteCallersCallees=True
def main():
dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False)
#convert strings into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
dataset['Program'] = dataset['Program'].astype('category').cat.codes
dataset['classGold'] = dataset['classGold'].astype('category').cat.codes
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes
dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes
dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes
dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes
dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes
dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes
dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes
dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes
dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes
dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes
dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes
dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes
print(dataset)
CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2)
and (dataset['CalleesU']==0 or dataset['CalleesU']==2)]
print(CompleteSet)
if __name__=="__main__":
main()
I am using the line dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes to convert the string values that can be taken by CallersU into digits. Similarly, I am using the line of code dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes to convert the string values that can be taken by CalleesU into digits. The four values that can be taken by CallersU/CalleesU are -1, Low,Medium,High. The line ...astype('category').cat.codes automatically makes the following conversions. -1 corresponds to 0, 1 Corresponds to High, 2 corresponds to Low and 3 corresponds to Medium. Thus, I am using the line CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)] to specify that I only want to select rows with either (CallersU==0 OR CallersU==2) and (CalleesU==0 OR CalleesU==2), the problem is that I am getting the error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). after executing the line of code CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)], How can I fix that and perform what's needed?
Replace the and with & and add ()
CompleteSet = dataset[((dataset['CallersU'] == 0) | (dataset['CallersU'] == 2)) & ((dataset['CalleesU']==0) | (dataset['CalleesU']==2))]

Grabbing characters after a certain value

Below is the dataframe. PIC_1 and Wgt are strings and p.lgth and p_lgth are integers. If p_lgth is not equal to 30, I want to find 42 in PIC_1 and grab 42 and the 15 digits that come after it.
PIC_1 Wgt p.lgth p_lgth
**PARTIAL-DECODE***P / 42011721930018984390078... 112 53 53
So the output from above should be 42011721930018984
My code that does not work follows:
def pic_mod(row):
if row['p_lgth'] !=30:
PIC_loc = row['PIC_1'].find('42')
PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 15]
elif row['p_lgth']==30:
PIC_2=PIC_1
return PIC_2
row_1 is just a row from the larger df that is identical to the example row given above
row_1 = df71[2:3]
pic_mod(row_1)
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool (), a.item(), a.any() or a.all().
I did type() on the variables and got
type(df71['PIC_1']) = pandas.core.series.Series
type(df71['p_lgth']) = pandas.core.series.Series
type(df71['Wgt']) = pandas.core.series.Series
I'm fairly new to Python. Should these data types come back as int and str? df71 is a df.
According to the error message in your post, perhaps try with this one:
def pic_mod(row):
if row['p_lgth'].any() != 30:
PIC_loc = row['PIC_1'].str.find('42')[0]
PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 17]
elif row['p_lgth'].any() == 30:
PIC_2=PIC_1
return PIC_2
However, if your data is already structured in a pandas dataframe, you normally wouldn't write such an explicit function.
E.g. the initial filtering of all rows in the dataset by p_legth not equal to 30 would be a single line like:
df_fltrd = df[df['p_lgth']!=30]
Having this done you could apply any arbitrary function to the entries in the PIC_1-column, e.g. in your case the substring of length 17 starting with '42':
df_fltrd['PIC_1'].apply(lambda x: x[x.find('42'):x.find('42')+17])

Categories

Resources