Implementing own function in pandas data frame - python

I am implementing my own function for calculating taxes. Below you can see the data
df = pd.DataFrame({"id_n":["1","2","3","4","5"],
"sales1":[0,115000,440000,500000,740000],
"sales2":[0,115000,460000,520000,760000],
"tax":[0,8050,57500,69500,69500]
})
Now I want to introduce a tax function that needs to give the same results as results in column tax. Below you can see an estimation of that function:
# Thresholds
min_threeshold = 500000
max_threeshold = 1020000
# Maximum taxes
max_cap = 69500
# Rates
rate_1 = 0.035
rate_2 = 0.1
# Total sales
total_sale = df['sales1'] + df['sales2']
tax = df['tax']
# Function for estimation
def tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2):
if (total_sale > 0 and tax == 0):
calc_tax = 0
elif (total_sale < min_threeshold):
calc_tax = total_sale * rate_1
elif (total_sale >= min_threeshold) & (total_sale <= max_threeshold):
calc_tax = total_sale * rate_2
elif (total_sale > max_threeshold):
calc_tax = max_cap
return calc_tax
So far so good. The next step is the execution of the above function. Below you can see the command :
df['new_tax']=tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
After execution of this command, I received this error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So can anybody help me how to solve this problem?

You want to use df.apply(), you're almost there!
def someFunc():
# do something here
return result
def main():
# setup df
new_col = 'col_name'
df[new_col] = df.apply(lambda x: someFunc(x['some_col']))
This is powerful because within x you now have access to each row, so you can pass data from the row to your custom function and then apply it - using optimized pandas - to each row.
I forget the object structure for x, so you may want to look into that and you can access what you need. Also, x can be named anything.
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
Hope this helps!

Related

The truth value of a Series is ambiguous. I have tried using the available answers but nothing worked

I'm getting error
"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()".
I have tried replacing and with '&' but still it didn't work.
roc_table = pd.DataFrame(columns = ['score','TP', 'FP','TN','FN'])
TP=0
FP=0
TN=0
FN=0
df_roc = pd.DataFrame([["score", "status"], [100, "accept"], [-80, "reject"]])
for score in range(-100,-80,5):
for row in df_roc.iterrows():
if (df_roc['score'] >= score) & (df_roc['status'] == 'reject'):
TP=TP+1
elif (df_roc['score'] >= score) & (df_roc['status'] == 'accept'):
FP=FP+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'accept'):
TN=TN+1
elif (df_roc['score'] < score) & (df_roc['status'] == 'reject'):
FN=FN+1
dict = {'score':score, 'TP': TP, 'FP': FP, 'TN': TN,'FN':FN}
roc_table = roc_table.append(dict, ignore_index = True)
sample of df_roc:
score
status
100
accept
-80
reject
df_roc['score'] >= score
The error is telling you that this comparison makes no sense.
df_roc['score'] is a column containing many values. Some may be less than score, and others may be greater.
Since this code is inside a for row in df_roc.iterrows() loop, I think you intended to compare just the score from the current row, but that's not what you actually did.
I replaced "df_roc['score']" with "row['score']" in the code and it worked. Thanks
(row['score'] >= score) & (row['status'] == 'reject')

Apply row wise conditional function on dataframe python

I have a dataframe in which I want to execute a function that checks if the actual value is a relative maximum, and check if the previous ''n'' values are lower than the actual value.
Having a dataframe 'df_data':
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1]
df_data = pd.DataFrame(temp)
First I create a function that will check the previous conditions:
def get_max(high, rolling_max, prev,post):
if ((high > prev) & (high>post) & (high>rolling_max)):
return 1
else:
return 0
df_data['rolling_max'] = df_data.high.rolling(n).max().shift()
Then I apply previous condition row wise:
df_data['ismax'] = df_data.apply(lambda x: get_max(df_data['high'], df_data['rolling_max'],df_data['high'].shift(1),df_data['high'].shift(-1)),axis = 1)
The problem is that I have always get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Which comes due to applying the boolean condition from 'get_max' function to a Serie.
I will love to have a vectorized function, not using loops.
Try:
df_data['ismax'] = ((df_data['high'].gt(df_data.high.rolling(n).max().shift())) & (df_data['high'].gt(df_data['high'].shift(1))) & (df_data['high'].gt(df_data['high'].shift(-1)))).astype(int)
The error is occuring because you are sending the entire series (entire column) to your get_max function rather than doing it row-wise. Creating new columns for the shifted "prev" and "post" values and then using df.apply(func, axis = 1) normally will work fine here.
As you have hinted at, this solution is quite inefficient and looping through every row will become much slower as your dataframe increases in size.
On my computer, the below code posts:
LIST_MULTIPLIER = 1, Vectorised code: 0.29s, Row-wise code: 0.38s
LIST_MULTIPLIER = 100, Vectorised code: 0.31s, Row-wise code = 13.27s
In general therefore it is best to avoid using df.apply(..., axis = 1) as you can almost always get a better solution using logical operators.
import pandas as pd
from datetime import datetime
LIST_MULTIPLIER = 100
ITERATIONS = 100
def get_dataframe():
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63,
131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99,
138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1] * LIST_MULTIPLIER
df = pd.DataFrame(temp_list)
df.columns = ['high']
return df
df_original = get_dataframe()
t1 = datetime.now()
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
mask_prev = df['high'] > df['high_prev']
mask_post = df['high'] > df['high_post']
mask_rolling = df['high'] > df['rolling_max']
mask_max = mask_prev & mask_post & mask_rolling
df['ismax'] = 0
df.loc[mask_max, 'ismax'] = 1
t2 = datetime.now()
print(f"{t2 - t1}")
df_first_method = df.copy()
t3 = datetime.now()
def get_max_rowwise(row):
if ((row.high > row.high_prev) &
(row.high > row.high_post) &
(row.high > row.rolling_max)):
return 1
else:
return 0
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
df['ismax'] = df.apply(get_max_rowwise, axis = 1)
t4 = datetime.now()
print(f"{t4 - t3}")
df_second_method = df.copy()

Ambiguous Truth Value - For Loop & If Statements

Chasing the reasoning behind the following error message.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is the code I am running:
anz_analysis["Action"] = anz_analysis["Signal"]
for i in range(0, len(anz_analysis) + 1):
if ((anz_analysis["Signal"].iloc[[i]] == "Buy") & (anz_analysis["Signal"].iloc[[i+1]] == "Buy")):
anz_analysis["Action"] = anz_analysis["Signal"].iloc[[i]] = "Maintain"
elif (anz_analysis["Signal"].iloc[[i]] = "Sell") & (anz_analysis["Signal"].iloc[[i+1]] = "Sell"):
anz_analysis["Signal"].iloc[[i]] = "Maintain"
The dataframe looks like this:
Current: Wanted:
_______|_________
1|Buy | Buy
2|Buy | Maintain
3|Buy | Maintain
4|Sell | Sell
5|Sell | Maintain
6|Sell | Maintain
Any help would be appreciated.
Using .iloc[[i]] returns a Series. So even though there is only one value in the Series, that ValueError arises, because you are comparing a Series to a str.
One way to deal with this is to use .iloc[[i]][0]. That will 'extract' the string and allow the comparison to be made between the same types.
A better way would be to select the row and column with iloc. E.g.
anz_analysis["Action"] = anz_analysis["Signal"]
for i in range(len(anz_analysis) + 1):
if ((anz_analysis.iloc[i, 'Signal'] == "Buy") & (anz_analysis.iloc[i + 1, 'Signal'] == "Buy")):
anz_analysis["Action"] = anz_analysis.iloc[i, 'Signal'] = "Maintain"
elif (anz_analysis.iloc[i, 'Signal'] == "Sell") & (anz_analysis.iloc[i + 1, 'Signal'] == "Sell"):
anz_analysis.iloc[i, 'Signal'] = "Maintain"
Edit
Updated the elif statement as per #6502's observation of the likely error in OP's code there. Also decided to remove the spurious start value of 0 in range.

SMOTE in python

I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples.
Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2, we get 100:200. But I am looking if there is any way to manually specify the number of instances to be generated for both the classes.
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts()
Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts()
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
In the above code, I am trying to manually specify the number for each of the classes, but I am getting the following error at the last line of code
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If I understand you correctly and the documentation here, you are not passing numbers as ratio. You are instead passing a series object.
The accepted types for ratio are:
float, str, dict or callable, (default=’auto’)
Please try doing:
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS
Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
This should now work, please try!

Defining a variable between two points with a datetime pandas series

I have a pandas dataframe, and I want to calculate a variable based on certain hours of the day. I already pulled the hours as integers out of the datetime series. When I write my conditional statements between two hours and execute my script, I get the warning "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
When I put in any() or all() in my script, the script runs but it doesn't calculate the value between the two hours. I just get back a value that is not in the conditions. Can anyone help me out?
Here is my code so far
METdata = pd.read_csv('C:\Schoolwork\GEOL 701s_HW1\MET_station\MET_Data_3.26_hourly.csv', infer_datetime_format = True, na_values = '', header = [1], skiprows = [2, 3], index_col = [0])
hour = METdata.index.hour
NET_rad_Wm2 = np.array(METdata['NR_Wm2_Avg'])
Nr = NET_rad_Wm2 * 0.0036
g_day = Nr * 0.1
g_night = Nr * 0.5
def func(hour):
if ((hour > 8) and (hour < 17)):
return g_night
else:
return g_day
g = func(hour)
If you want a series as return, then you just need to call apply instead of calling the function directly
hour.apply(func)

Categories

Resources