This question already has answers here:
vectorize conditional assignment in pandas dataframe
(5 answers)
Closed 19 days ago.
I have a dataframe containing some input values, and I trying to evaluate a parameter conditioned on the values given in a column (example below).
What I would like to obtain is shown in the figure:
How can I solve the issue below?
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'x': [0,1,2,3,4],
'y': [100,100,100,100,100],
'z': [100,100,100,100,100],
})
def evaluate(input):
if input <=2:
a=4
b=6
else:
a=7
b=8
return df['x']*a+b*(df['y']+df['z'])
df['calc'] = evaluate(df['x'])
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> ~\AppData\Local\Temp/ipykernel_38760/3329748611.py in <module>
> 15 b=8
> 16 return df['x']*a+b*(df['y']+df['z'])
> 17 df['calc'] = evaluate(df['x'])
>
> ~\AppData\Local\Temp/ipykernel_38760/3329748611.py in evaluate(input)
> 8 })
> 9 def evaluate(input):
> 10 if input <=2:
> 11 a=4
> 12 b=6
>
> ~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
> 1535 #final
> 1536 def __nonzero__(self):
> 1537 raise ValueError(
> 1538 f"The truth value of a {type(self).__name__} is ambiguous. "
> 1539 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
>
> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The problem is your function is not vectorizing properly over the array df['x'], specifically your if statement if input <=2: cannot be applied to the column df['x'] generally. The better way to do this is with a loop, or df.apply(). Since df.apply() is trickier with multiple columns involved, the easiest solution is probably just create a function with a simple for loop using df.itertuples().
def evaluate(df):
calc_column = []
for row in df.itertuples():
if row.x <= 2:
a = 4
b = 6
else:
a = 7
b = 8
calc_column.append(row.x*a + b*(row.y + row.z))
return calc_column
df['calc'] = evaluate(df)
df
x y z calc
0 0 100 100 1200
1 1 100 100 1204
2 2 100 100 1208
3 3 100 100 1621
4 4 100 100 1628
Here each row element can be accessed from the df.itertuples() method, allowing you to easily grab the values you need from each row.
Use your conditions to create a mask. Use the mask to filter the rows you want to operate on. Return a Series.
def evaluate(df):
S = pd.Series(0,index=df.index)
mask = df['x'] <= 2
a,b = 4,6
S[mask] = df.loc[mask,'x']*a + b*(df.loc[mask,['y','z']].sum(1))
mask = df['x'] > 2
a,b = 7,8
S[mask] = df.loc[mask,'x']*a + b*(df.loc[mask,['y','z']].sum(1))
return S
df['calc'] = evaluate(df)
Or just...
def evaluate(df):
S = pd.Series(0,index=df.index)
mask = df['x'] <= 2
a,b = 4,6
S[mask] = df.loc[mask,'x']*a + b*(df.loc[mask,['y','z']].sum(1))
# mask = df['x'] > 2
a,b = 7,8
S[~mask] = df.loc[~mask,'x']*a + b*(df.loc[~mask,['y','z']].sum(1))
return S
Related
I would like to create a brand new data frame by replacing values of a DF using a custom function.
I keep getting the following error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
I tried some suggestions (Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()) but it didn't work.
I would appreciate if somebody could shed light on this issue and help me out.
Thank you in advance for your time.
def convert2integer(x):
if x <= -0.5:
return -1
elif (x > -0.5 & x <= 0.5):
return 0
elif x > 0.5:
return 1
df = pd.DataFrame({'A':[1,0,-0.6,-1,0.7],
'B':[-1,1,-0.3,0.5,1]})
df.apply(convert2integer)
A few options:
The slower option but the most similar via applymap:
def convert2integer(x):
if x <= -0.5:
return -1
elif x <= 0.5:
return 0
else:
return 1
df = pd.DataFrame({'A': [1, 0, -0.6, -1, 0.7],
'B': [-1, 1, -0.3, 0.5, 1]})
new_df = df.applymap(convert2integer)
new_df:
A B
0 1 -1
1 0 1
2 -1 0
3 -1 0
4 1 1
applymap applies the function to each cell in the DataFrame. For this reason, x is a float, and should be treated as such.
The faster option via np.select:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 0, -0.6, -1, 0.7],
'B': [-1, 1, -0.3, 0.5, 1]})
new_df = pd.DataFrame(
np.select([df.le(-0.5), df.le(0.5)],
[-1, 0],
default=1),
index=df.index,
columns=df.columns
)
new_df:
A B
0 1 -1
1 0 1
2 -1 0
3 -1 0
4 1 1
np.select takes a list of conditions, and a list of choices. When the condition is True it uses the values from the corresponding index in the choice list.
The last condition does not need checked as if it did not match the first two conditions it must be greater than 0.5. Likewise, the second condition does not need to also check that it is greater than -0.5 because if it were the first condition would have been met.
i get error while running this code
what to do?
i get error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
unrealized = [0, 0.50, 0.90, 0.20, 3, 6, 7, 2]
def stoploss():
df = pd.DataFrame({"price": unrealized})
df['high'] = df.cummax()
if df['high'] <= 0.10:
df['trailingstop'] = -0.50
df['signalstop'] = df['price'] < df['trailingstop']
if df['high'] >= 0.10:
df['trailingstop'] = df['high'] - 0.10
df['signalstop'] = df['price'] < df['trailingstop']
return df['signalstop'].iloc[-1]
print(stoploss())
Well, it is because the truth value of a series is ambiguous. But what does that mean? Just check the output of df['high']<=0.1 and you'll see a series of True/False values depending on if the condition is met or not. And you are asking the truth value of this series by using the if statement. And what should that be? That is exactly what the error is telling you. "Should I use any or all or what should I do with this series?"
But I assume you want to do something else: You want to set these two extra columns depending on the value in the high column. Use the .loc with a condition to set a value for all items matching the condition as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
So the code might look like this (if I guessed your intention correctly):
df['high'] = df.cummax()
df.loc[df['high']<=0.1,'trailingstop'] = -0.50
df.loc[df['high']<=0.1,'signalstop'] = df['price'] < df['trailingstop']
df.loc[df['high']>=0.1,'trailingstop'] = df['high']-0.10
df.loc[df['high']>=0.1,'signalstop'] = df['price'] < df['trailingstop']
The problem is in the lines
if df['high'] <= 0.10:
and
if df['high'] >= 0.10:
Because you are comparing a whole series against a number
Not entirely sure what your goal is, but here is some working code:
unrealized = [0, 0.50, 0.90, 0.20, 3, 6, 7, 2]
def stoploss():
df = pd.DataFrame({"price": unrealized})
df['high'] = df.cummax()
df['trailingstop'] = 0 # just to create the series in the DF
df['trailingstop'][df['high'] <= 0.10] = -0.50
df['trailingstop'][df['high'] >= 0.10] = df['high'] - 0.10
print(df['trailingstop'])
df['signalstop'] = df['price'] < df['trailingstop']
print(df['signalstop'])
return df['signalstop'].iloc[-1]
print(stoploss())
Result:
0 -0.5
1 0.4
2 0.8
3 0.8
4 2.9
5 5.9
6 6.9
7 6.9
Name: trailingstop, dtype: float64
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 True
Name: signalstop, dtype: bool
True
I am a newcomer to python. I want to implement a "For" loop on the elements of a dataframe, with an embedded "if" statement.
Code:
import numpy as np
import pandas as pd
#Dataframes
x = pd.DataFrame([1,-2,3])
y = pd.DataFrame()
for i in x.iterrows():
for j in x.iteritems():
if x>0:
y = x*2
else:
y = 0
With the previous loop, I want to go through each item in the x dataframe and generate a new dataframe y based on the condition in the "if" statement. When I run the code, I get the following error message.
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any help would be much appreciated.
In pandas is best avoid loops if exist vectorized solution:
x = pd.DataFrame([1,-2,3], columns=['a'])
y = pd.DataFrame(np.where(x['a'] > 0, x['a'] * 2, 0), columns=['b'])
print (y)
b
0 2
1 0
2 6
Explanation:
First compare column by value for boolean mask:
print (x['a'] > 0)
0 True
1 False
2 True
Name: a, dtype: bool
Then use numpy.where for set values by conditions:
print (np.where(x['a'] > 0, x['a'] * 2, 0))
[2 0 6]
And last use DataFrame constructor or create new column:
x['new'] = np.where(x['a'] > 0, x['a'] * 2, 0)
print (x)
a new
0 1 2
1 -2 0
2 3 6
You can try this:
y = (x[(x > 0)]*2).fillna(0)
I'm trying to create a new feature using
df_transactions['emome'] = df_transactions['emome'].apply(lambda x: 1 if df_transactions['plan_list_price'] ==0 & df_transactions['actual_amount_paid'] > 0 else 0).astype(int)
But it raises error
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
How can I create a new column that returns 1 when plan_list_price is 0 and actual_amount_paid is >0 else 0?
I would like to still use pandas apply.
You are really close, but much better is vectorized solution without apply - get boolean mask and convert to int:
mask = (df_transactions['plan_list_price'] == 0) &
(df_transactions['actual_amount_paid'] > 0)
df_transactions['emome'] = mask.astype(int)
If really want slowier apply:
f = lambda x: 1 if x['plan_list_price'] ==0 and x['actual_amount_paid'] > 0 else 0
df_transactions['emome'] = df_transactions.apply(f, axis=1)
Sample:
df_transactions = pd.DataFrame({'A':list('abcdef'),
'plan_list_price':[0,0,0,5,5,0],
'actual_amount_paid':[-1,0,9,4,2,3]})
mask = (df_transactions['plan_list_price'] == 0) & \
(df_transactions['actual_amount_paid'] > 0)
df_transactions['emome1'] = mask.astype(int)
f = lambda x: 1 if x['plan_list_price'] ==0 and x['actual_amount_paid'] > 0 else 0
df_transactions['emome2'] = df_transactions.apply(f, axis=1)
print (df_transactions)
A actual_amount_paid plan_list_price emome1 emome2
0 a -1 0 0 0
1 b 0 0 0 0
2 c 9 0 1 1
3 d 4 5 0 0
4 e 2 5 0 0
5 f 3 0 1 1
Timings:
#[60000 rows]
df_transactions = pd.concat([df_transactions] * 10000, ignore_index=True)
In [201]: %timeit df_transactions['emome1'] = ((df_transactions['plan_list_price'] == 0) & (df_transactions['actual_amount_paid'] > 0)).astype(int)
1000 loops, best of 3: 971 µs per loop
In [202]: %timeit df_transactions['emome2'] = df_transactions.apply(lambda x: 1 if x['plan_list_price'] ==0 and x['actual_amount_paid'] > 0 else 0, axis=1)
1 loop, best of 3: 1.15 s per loop
A few issues:
On the right side of the equation, the new field (emome)is not
created yet.
The lambda function is on x, not on df_transactions, which does not exist in this scope.
You need to specify axis since you are applying to each row (default is to each column).
From Doc:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 Axis along which the
function is applied:
0 or ‘index’: apply function to each column. 1 or ‘columns’: apply
function to each row.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
I have table:
tel size 1 2 3 4
0 123 1 Baby Baby None None
1 234 1 Shave Shave None None
2 222 1 Baby Baby None None
3 333 1 Shave Shave None None
I want to check if values in tables 1,2,3,4 ... are partly equal with 2 loops:
x = df_map.iloc[i,2:]
y = df_map.iloc[j,2:]
so df_map.iloc[0,2:] should be equal to df_map.iloc[2,2:],
and df_map.iloc[1,2:], is equal to df_map.iloc[3,2:],
I tried:
x == y
and
y.eq(x)
but it returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If i use (x==y).all() or (x==y).any() it returns wrong result.
I need something like:
if x== y:
counter += 1
Update:
problem was in None values. I used fillna('') and (x == y).all()
fillna('') because None == None is False
use numpy broadcasting evaluate ==
all(-1) to make sure the whole row matches
np.fill_diagonal because we don't need self matches
np.where to find where the matches are
v = df.fillna('').values[:, 2:]
match = ((v[None, :] == v[:, None]).all(-1))
np.fill_diagonal(match, False)
i, j = np.where(match)
pd.Series(i, j)
2 0
3 1
0 2
1 3
dtype: int64
THe type of pandas series(rows,columns) are numpy array,you can only get the results by column to column unless you loop again from the results which is also another array
import numpy as np
x = df_map[i,2:]
y = df_map[j,2:]
np.equal(y.values,x.values)