I am trying to make a set of synthetic data. I am using Python 3.5.2 . I start by defining it as so:
#Make Synthetic data (great again?)
#synth X
data=pd.DataFrame(np.random.randn(100,5), columns= 'x1','x2','x3','x4','x5'])
def div10(x):
if x<0:
return -x/5
else:
return x/10
data=data.applymap(div10)
From here I want to define a new column that is the string 'dead' if the hyperbolic tangent of the mean of the Xs in the row is greater than .15 and 'alive otherwise:
data['Y']=data.apply( lambda x:'dead' if np.tanh(data.mean(axis=1))>.15 else 'alive',axis=1)
I am told ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0') When I check np.tanh(data.mean(axis=1))>.15 I get a list of bools.
I have also tried map, but AttributeError: 'DataFrame' object has no attribute 'map'
learn to use where statements for better and faster code.
np.where(np.tanh(np.mean(np.where(data<0, -data / 5, data / 10), axis=1)) > .15, 'dead', 'alive')
Lets break this into chunks. where statements can operate on multidimensional data such as the dataframe that you have. They take a conditional and return the first argument after the comma when true and the second when false
step1 = np.where(data<0, -data / 5, data / 10)
There is no need to use apply as numpy has a vectorized mean function that you can apply by row (axis=1)
step2 = np.mean(step1, axis=1)
Now you have one dimensional data. take hyperbolic tangent
step3 = np.tanh(step2)
Finally use another where statement to get dead or alive condition
np.where(step3 > .15, 'dead', 'alive')
You need to make sure that you are using the 'x' specified in your lambda function. Also since x is a series (the row in the data frame) axis=1 won't work.
data['Y']=data.apply(lambda x:'dead' if np.tanh(x.mean())>.15 else 'alive',axis=1)
Related
I am using conditional multiplication within data frame and using following syntax:
if(df_merged1["region_id"]=="EMEA"):
df_merged1["fcst_gr"] = df_merged1["plan_price_amount"]*(df_merged1["Enr"]-df_merged1["FM_f"])+df_merged1["OA_f"]-df_merged1["TX_f"]
else:
df_merged1["fcst_gr"] = df_merged1["plan_price_amount"]*(df_merged1["Enr"]-df_merged1["FM_f"])+df_merged1["OA_f"]
i want tax to be substracted only when region is EMEA. but getting following error
ValueError: The truth value of a {type(self).__name__} is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think there is some problem in proving the if condition but how to resolve it not getting any idea
There is no problem here - df_merged1["region_id"] == "EMEA" returns a pd.Series instance populated with boolean values, not a boolean that can be handled using conditional statements. Pandas is reluctant to automatically run a method that would convert a pd.Series instance to a boolean like pd.Series.any() or pd.Series.all(), hence the error.
To achieve what you have meant to do for reasonably sized dataframes, use pd.DataFrame.apply, axis=1 with a lambda expression and a ternary operator. That way you populate a column ["fcst_gr"] based on value in column ["region_id"] for each individual row:
df_merged1["fcst_gr"] = df_merged1.apply(
lambda row: row["plan_price_amount"] * (row["Enr"] - row["FM_f"])
+ row["OA_f"]
- row["TX_f"]
if row["region_id"] == "EMEA"
else row["plan_price_amount"] * (row["Enr"] - row["FM_f"]) + row["OA_f"],
axis=1,
)
For bigger dataframes or more complex scenarios, consider more efficient solutions.
I'm trying to compute a new column in a pandas dataframe, based upon others columns, and a function I created. Instead of using a for loop, I prefer to apply the function with entires dataframe columns.
My code is like this :
df['po'] = vect.func1(df['gra'],
Se,
df['p_a'],
df['t'],
Tc)
where df['gra'], df['p_a'], and df['t'] are my dataframe columns (parameters), and Se and Tc are others (real) parameters. df['po'] is my new column.
func1 is a function described in my vect package.
This function is :
def func1(g, surf_e, Pa, t, Tco):
if (t <= Tco):
pos = (g-(Pa*surf_e*g))
else:
pos = 0.0
return(pos)
When implemented this way, I obtain an error message, which concern the line : if (t <= Tco):
The error is :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I read the pandas documentation, but didn't find the solution. Can anybody explain me what is the problem ?
I tried to use apply :
for example :
df['po'] = df['gra'].apply(vect.func1)
but I don't know how to use apply with multiples columns as parameters.
Thank you by advance.
Use np.where with the required condition, value when the condition is True and the default value.
df['po'] = np.where(
df['t'] <= Tc, # Condition
df['gra'] - (df['P_a'] * Se * df['gra']), # Value if True
0 # Value if False
)
EDIT:
Don't forget to import numpy as np
Also, you get an error because you are comparing a series to a series
and hence obtain a series of boolean values and not an atomic boolean
value which if condition needs.
I want to have a simple function that would categorize numeric values from existing column into a new column. For some reason when doing it with a function that has multiple arguments "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." is generated...
DataFrame:
l1=[1,2,3,4]
df_=pd.DataFrame(l1, columns=["Nums"])
Code that generate Error:
n1=2
n2=4
def func(x,y,z):
if (x>=y) & (x<=z):
return('good')
else:
return('bad')
df_['Nums_Cat']=func(df_.Nums, n1, n2)
Please note, that I'm trying to do this with a function approach as it will be applied to multiple columns with many different conditions passed.
In this case I'm trying to convert those numeric values that fall under this condition into string "good" and those that dont (else) into string "bad" So, that output should be 'bad, good, good, good' in a new column called Num_Cat.
Your nearly there. However Python's functions don't work the way you want. To do what you want you need to map each value from the result into either "good" or "bad".
def func(x, y, z):
values = (y <= x) & (x <= z)
return values.map(lambda item: "good" if item else "bad")
I am trying to create portfolios in dataframes depended on the variable 'scope' leaving the rows with the highest 33% of the scope-values in the first portfolio in a dataframe, middle 34% in the second and bottom 33% in the third for each time period and industry.
So far, I grouped the data on date and industry
group_first = data_clean.groupby(['date','industry'])
and used a lambda function afterwards to get the rows of the first tercile of 'scope' for every date and industry; for instance:
port = group_first.apply(lambda x: x[x['scope'] <= x.scope.quantile(0.33)]).reset_index(drop=True)
This works for the first and third tercile, however not for the middle one, because I get
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
putting two condition in the lambda function, like this:
group_middle = data_clean.groupby(['date','industry'])
port_middle = group_middle.apply(lambda x: (x[x['scope'] > x.scope.quantile(0.67)]) and (x[x['scope'] < x.scope.quantile(0.33)])).reset_index(drop=True)
In other words, how can I get the rows of a dataframe containing the values in 'scope' between the 33rd and 67th percentile after grouping for date and industry?
Any idea how to solve this?
I will guess - I don't have data to test it.
You use wrong < and > and you check scope<33 and scope>67 which gets 0...33 and 67...100 (and it may give empty data) but you need scope>33 and scope<67 to get 33..67
You may also use x[ scope>33 & scope<67 ] instead of x[scope>33] and x[scope<67]
port_middle = group_middle.apply(lambda x:
x[
(x['scope'] > x.scope.quantile(0.33)) & (x['scope'] < x.scope.quantile(0.67)
]
).reset_index(drop=True)
I am researching/backtesting a trading system.
I have a Pandas dataframe containing OHLC data and have added several calculated columns which identify price patterns that I will use as signals to initiate positions.
I would now like to add a further column that will keep track of the current net position. I have tried using df.apply(), but passing the dataframe itself as the argument instead of the row object, as with the latter I seem to be unable to look back at previous rows to determine whether they resulted in any price patterns:
open_campaigns = []
Campaign = namedtuple('Campaign', 'open position stop')
def calc_position(df):
# sum of current positions + any new positions
if entered_long(df):
open_campaigns.add(
Campaign(
calc_long_open(df.High.shift(1)),
calc_position_size(df),
calc_long_isl(df)
)
)
return sum(campaign.position for campaign in open_campaigns)
def entered_long(df):
return buy_pattern(df) & (df.High > df.High.shift(1))
df["Position"] = df.apply(lambda row: calc_position(df), axis=1)
However, this returns the following error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', u'occurred at index 1997-07-16 08:00:00')
Rolling window functions would seem to be the natural fit, but as I understand it, they only act on a single time series or column, so wouldn't work either as I need to access the values of multiple columns at multiple timepoints.
How should I in fact be doing this?
This problem has its roots in NumPy.
def entered_long(df):
return buy_pattern(df) & (df.High > df.High.shift(1))
entered_long is returning an array-like object. NumPy refuses to guess if an array is True or False:
In [48]: x = np.array([ True, True, True], dtype=bool)
In [49]: bool(x)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
To fix this, use any or all to specify what you mean for an array to be True:
def calc_position(df):
# sum of current positions + any new positions
if entered_long(df).any(): # or .all()
The any() method will return True if any of the items in entered_long(df) are True.
The all() method will return True if all the items in entered_long(df) are True.