How to apply a function with several dataframe columns as arguments? - python

I'm trying to compute a new column in a pandas dataframe, based upon others columns, and a function I created. Instead of using a for loop, I prefer to apply the function with entires dataframe columns.
My code is like this :
df['po'] = vect.func1(df['gra'],
Se,
df['p_a'],
df['t'],
Tc)
where df['gra'], df['p_a'], and df['t'] are my dataframe columns (parameters), and Se and Tc are others (real) parameters. df['po'] is my new column.
func1 is a function described in my vect package.
This function is :
def func1(g, surf_e, Pa, t, Tco):
if (t <= Tco):
pos = (g-(Pa*surf_e*g))
else:
pos = 0.0
return(pos)
When implemented this way, I obtain an error message, which concern the line : if (t <= Tco):
The error is :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I read the pandas documentation, but didn't find the solution. Can anybody explain me what is the problem ?
I tried to use apply :
for example :
df['po'] = df['gra'].apply(vect.func1)
but I don't know how to use apply with multiples columns as parameters.
Thank you by advance.

Use np.where with the required condition, value when the condition is True and the default value.
df['po'] = np.where(
df['t'] <= Tc, # Condition
df['gra'] - (df['P_a'] * Se * df['gra']), # Value if True
0 # Value if False
)
EDIT:
Don't forget to import numpy as np
Also, you get an error because you are comparing a series to a series
and hence obtain a series of boolean values and not an atomic boolean
value which if condition needs.

Related

The truth value of a is ambiguous when I used Iflese in Python

I am using conditional multiplication within data frame and using following syntax:
if(df_merged1["region_id"]=="EMEA"):
df_merged1["fcst_gr"] = df_merged1["plan_price_amount"]*(df_merged1["Enr"]-df_merged1["FM_f"])+df_merged1["OA_f"]-df_merged1["TX_f"]
else:
df_merged1["fcst_gr"] = df_merged1["plan_price_amount"]*(df_merged1["Enr"]-df_merged1["FM_f"])+df_merged1["OA_f"]
i want tax to be substracted only when region is EMEA. but getting following error
ValueError: The truth value of a {type(self).__name__} is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think there is some problem in proving the if condition but how to resolve it not getting any idea
There is no problem here - df_merged1["region_id"] == "EMEA" returns a pd.Series instance populated with boolean values, not a boolean that can be handled using conditional statements. Pandas is reluctant to automatically run a method that would convert a pd.Series instance to a boolean like pd.Series.any() or pd.Series.all(), hence the error.
To achieve what you have meant to do for reasonably sized dataframes, use pd.DataFrame.apply, axis=1 with a lambda expression and a ternary operator. That way you populate a column ["fcst_gr"] based on value in column ["region_id"] for each individual row:
df_merged1["fcst_gr"] = df_merged1.apply(
lambda row: row["plan_price_amount"] * (row["Enr"] - row["FM_f"])
+ row["OA_f"]
- row["TX_f"]
if row["region_id"] == "EMEA"
else row["plan_price_amount"] * (row["Enr"] - row["FM_f"]) + row["OA_f"],
axis=1,
)
For bigger dataframes or more complex scenarios, consider more efficient solutions.

Error: Series is ambiguous | Function with multiple arguments | DataFrame

I want to have a simple function that would categorize numeric values from existing column into a new column. For some reason when doing it with a function that has multiple arguments "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." is generated...
DataFrame:
l1=[1,2,3,4]
df_=pd.DataFrame(l1, columns=["Nums"])
Code that generate Error:
n1=2
n2=4
def func(x,y,z):
if (x>=y) & (x<=z):
return('good')
else:
return('bad')
df_['Nums_Cat']=func(df_.Nums, n1, n2)
Please note, that I'm trying to do this with a function approach as it will be applied to multiple columns with many different conditions passed.
In this case I'm trying to convert those numeric values that fall under this condition into string "good" and those that dont (else) into string "bad" So, that output should be 'bad, good, good, good' in a new column called Num_Cat.
Your nearly there. However Python's functions don't work the way you want. To do what you want you need to map each value from the result into either "good" or "bad".
def func(x, y, z):
values = (y <= x) & (x <= z)
return values.map(lambda item: "good" if item else "bad")

How to create a dataframe on two conditions in a lambda function using apply after groupby()?

I am trying to create portfolios in dataframes depended on the variable 'scope' leaving the rows with the highest 33% of the scope-values in the first portfolio in a dataframe, middle 34% in the second and bottom 33% in the third for each time period and industry.
So far, I grouped the data on date and industry
group_first = data_clean.groupby(['date','industry'])
and used a lambda function afterwards to get the rows of the first tercile of 'scope' for every date and industry; for instance:
port = group_first.apply(lambda x: x[x['scope'] <= x.scope.quantile(0.33)]).reset_index(drop=True)
This works for the first and third tercile, however not for the middle one, because I get
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
putting two condition in the lambda function, like this:
group_middle = data_clean.groupby(['date','industry'])
port_middle = group_middle.apply(lambda x: (x[x['scope'] > x.scope.quantile(0.67)]) and (x[x['scope'] < x.scope.quantile(0.33)])).reset_index(drop=True)
In other words, how can I get the rows of a dataframe containing the values in 'scope' between the 33rd and 67th percentile after grouping for date and industry?
Any idea how to solve this?
I will guess - I don't have data to test it.
You use wrong < and > and you check scope<33 and scope>67 which gets 0...33 and 67...100 (and it may give empty data) but you need scope>33 and scope<67 to get 33..67
You may also use x[ scope>33 & scope<67 ] instead of x[scope>33] and x[scope<67]
port_middle = group_middle.apply(lambda x:
x[
(x['scope'] > x.scope.quantile(0.33)) & (x['scope'] < x.scope.quantile(0.67)
]
).reset_index(drop=True)

How to loop through a pandas array?

I read several questions about for loop of pandas Dataframe, but couldn't work it out for my case
px=pd.read_sql()
for i, row in px.iterrows():
if x == 1 :
if px['first'] <= S1 :
S1 = px['first'];
if px['second'] > S2 :
previous_value = last_value;
last_value = px['second'];
x = 0;
else :
.....
This is, of course, part of the code to show the looping logic. I expected that the rows are read one by one as I can compare the values of each row with the previous row, but
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You're accessing the entire column px['first'] from inside the loop that's intended to access only one entry at a time.
To fix your current loop, it might be enough just to change px['first'] to row['first'], and also px['second'] to row['second'].
Better would be to replace this manual looping with equivalent pandas expressions, which will be much faster and readable. If you post the full code (edit into question, not as comments!), we might be able to help.

Column in pandas conditional on average of other columns

I am trying to make a set of synthetic data. I am using Python 3.5.2 . I start by defining it as so:
#Make Synthetic data (great again?)
#synth X
data=pd.DataFrame(np.random.randn(100,5), columns= 'x1','x2','x3','x4','x5'])
def div10(x):
if x<0:
return -x/5
else:
return x/10
data=data.applymap(div10)
From here I want to define a new column that is the string 'dead' if the hyperbolic tangent of the mean of the Xs in the row is greater than .15 and 'alive otherwise:
data['Y']=data.apply( lambda x:'dead' if np.tanh(data.mean(axis=1))>.15 else 'alive',axis=1)
I am told ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0') When I check np.tanh(data.mean(axis=1))>.15 I get a list of bools.
I have also tried map, but AttributeError: 'DataFrame' object has no attribute 'map'
learn to use where statements for better and faster code.
np.where(np.tanh(np.mean(np.where(data<0, -data / 5, data / 10), axis=1)) > .15, 'dead', 'alive')
Lets break this into chunks. where statements can operate on multidimensional data such as the dataframe that you have. They take a conditional and return the first argument after the comma when true and the second when false
step1 = np.where(data<0, -data / 5, data / 10)
There is no need to use apply as numpy has a vectorized mean function that you can apply by row (axis=1)
step2 = np.mean(step1, axis=1)
Now you have one dimensional data. take hyperbolic tangent
step3 = np.tanh(step2)
Finally use another where statement to get dead or alive condition
np.where(step3 > .15, 'dead', 'alive')
You need to make sure that you are using the 'x' specified in your lambda function. Also since x is a series (the row in the data frame) axis=1 won't work.
data['Y']=data.apply(lambda x:'dead' if np.tanh(x.mean())>.15 else 'alive',axis=1)

Categories

Resources