How to convert a set of probabilities to 0 and 1?

How to convert a set of probabilities to 0 and 1? - python

I have given a dataset that contains two columns 'y' and 'proba'. 'y' has two class labels '0' and '1' and 'proba' is the probability.
I have to create a list 'y_hat' If my 'proba' < 0.5 then I append 0 else 1. I have written the code:
y_hat = [0 if (df_5a['proba']<0.5) else 1]
However, I'm getting the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

You can use:
(df['proba'] >= .5).astype(int)
or
df['proba'].ge(.5).astype(int)

y_hat = list(map(lambda x: 0 if x else 1, (df_5a['proba']<0.5).values))

Assuming your dataset looks like this:
data= {'y':[0,0,1,0,1,0],
'proba': [0.2,0.3,0.25,0.8,0.9,0.15]}
df_5a= pd.DataFrame(data)
df_5a
Output:
y proba
0 0 0.20
1 0 0.30
2 1 0.25
3 0 0.80
4 1 0.90
5 0 0.15
your code does not work because, as the error indicates, it is ambiguous in terms of what you are specifically checking.
For instance, if you only wanted to apply your condition to the 1st row (['proba'][0]), then by adding its index your code would work fine:
y_hat = [0 if (df_5a['proba'][0] <0.5) else 1]
y_hat
Output:
[0]
Hence, although I find the answer of #Gopal Gautam right, below is my suggestion from another approach using pandas.DataFrame.itertuples, through which you can iterate over each row:
y_hat=[]
for row in df_5a.itertuples(index=True, name='Pandas'):
if row.proba < 0.5:
y_hat.append(0)
else:
y_hat.append(1)
print(y_hat)
Output:
[0, 0, 0, 1, 1, 0]

Related

Pandas apply custom function to DF

I would like to create a brand new data frame by replacing values of a DF using a custom function.
I keep getting the following error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
I tried some suggestions (Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()) but it didn't work.
I would appreciate if somebody could shed light on this issue and help me out.
Thank you in advance for your time.
def convert2integer(x):
if x <= -0.5:
return -1
elif (x > -0.5 & x <= 0.5):
return 0
elif x > 0.5:
return 1
df = pd.DataFrame({'A':[1,0,-0.6,-1,0.7],
'B':[-1,1,-0.3,0.5,1]})
df.apply(convert2integer)

A few options:
The slower option but the most similar via applymap:
def convert2integer(x):
if x <= -0.5:
return -1
elif x <= 0.5:
return 0
else:
return 1
df = pd.DataFrame({'A': [1, 0, -0.6, -1, 0.7],
'B': [-1, 1, -0.3, 0.5, 1]})
new_df = df.applymap(convert2integer)
new_df:
A B
0 1 -1
1 0 1
2 -1 0
3 -1 0
4 1 1
applymap applies the function to each cell in the DataFrame. For this reason, x is a float, and should be treated as such.
The faster option via np.select:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 0, -0.6, -1, 0.7],
'B': [-1, 1, -0.3, 0.5, 1]})
new_df = pd.DataFrame(
np.select([df.le(-0.5), df.le(0.5)],
[-1, 0],
default=1),
index=df.index,
columns=df.columns
)
new_df:
A B
0 1 -1
1 0 1
2 -1 0
3 -1 0
4 1 1
np.select takes a list of conditions, and a list of choices. When the condition is True it uses the values from the corresponding index in the choice list.
The last condition does not need checked as if it did not match the first two conditions it must be greater than 0.5. Likewise, the second condition does not need to also check that it is greater than -0.5 because if it were the first condition would have been met.

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). IF Statements [duplicate]

i get error while running this code
what to do?
i get error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
unrealized = [0, 0.50, 0.90, 0.20, 3, 6, 7, 2]
def stoploss():
df = pd.DataFrame({"price": unrealized})
df['high'] = df.cummax()
if df['high'] <= 0.10:
df['trailingstop'] = -0.50
df['signalstop'] = df['price'] < df['trailingstop']
if df['high'] >= 0.10:
df['trailingstop'] = df['high'] - 0.10
df['signalstop'] = df['price'] < df['trailingstop']
return df['signalstop'].iloc[-1]
print(stoploss())

Well, it is because the truth value of a series is ambiguous. But what does that mean? Just check the output of df['high']<=0.1 and you'll see a series of True/False values depending on if the condition is met or not. And you are asking the truth value of this series by using the if statement. And what should that be? That is exactly what the error is telling you. "Should I use any or all or what should I do with this series?"
But I assume you want to do something else: You want to set these two extra columns depending on the value in the high column. Use the .loc with a condition to set a value for all items matching the condition as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
So the code might look like this (if I guessed your intention correctly):
df['high'] = df.cummax()
df.loc[df['high']<=0.1,'trailingstop'] = -0.50
df.loc[df['high']<=0.1,'signalstop'] = df['price'] < df['trailingstop']
df.loc[df['high']>=0.1,'trailingstop'] = df['high']-0.10
df.loc[df['high']>=0.1,'signalstop'] = df['price'] < df['trailingstop']

The problem is in the lines
if df['high'] <= 0.10:
and
if df['high'] >= 0.10:
Because you are comparing a whole series against a number
Not entirely sure what your goal is, but here is some working code:
unrealized = [0, 0.50, 0.90, 0.20, 3, 6, 7, 2]
def stoploss():
df = pd.DataFrame({"price": unrealized})
df['high'] = df.cummax()
df['trailingstop'] = 0 # just to create the series in the DF
df['trailingstop'][df['high'] <= 0.10] = -0.50
df['trailingstop'][df['high'] >= 0.10] = df['high'] - 0.10
print(df['trailingstop'])
df['signalstop'] = df['price'] < df['trailingstop']
print(df['signalstop'])
return df['signalstop'].iloc[-1]
print(stoploss())
Result:
0 -0.5
1 0.4
2 0.8
3 0.8
4 2.9
5 5.9
6 6.9
7 6.9
Name: trailingstop, dtype: float64
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 True
Name: signalstop, dtype: bool
True

compare value of the current index to the value of next index in Pandas df

I'm trying to compare the value of the current index to the value of the next index in my pandas data frame. I'm able to access the value with iloc but when I write an if condition to validate the value. It gives me an error.
Code I tried:
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
for k in range(len(df)):
if df.iloc[k+1] > df.iloc[k]:
trend.append('up')
if df.iloc[k+1] < df.iloc[k]:
trend.append('down')
if df.iloc[k+1] == df.iloc[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I tried assigning the iloc[k] value to a variable "current" with astype=int. Still I am unable to use the variable "current" in my if condition validation. Appreciate if somebody can help with info on how to resolve it.

You are getting the error becauce
df.iloc[k] gives you a pd.Series.
You can use say df.iloc[k,0] to get the Col1 value

So, what I have done is I have converted that particular column into a list. Instead of working directly from the Series object returned by the dataframe, I prefer converting it to a list or numpy array first and then performing basic functions on it.
I have also given the correct working code below.
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
temp=df['Col1'].tolist()
print(temp)
for k in range(len(df)-1):
if temp[k+1] > temp[k]:
trend.append('up')
if temp[k+1] < temp[k]:
trend.append('down')
if temp[k+1] == temp[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
print(trend)

Here is a more pandas-like approach. We can get the difference of two consecutive elements of a series easily via pandas.DataFrame.diff:
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
df_diff = df.diff()
Col1
0 NaN
1 -1.0
2 1.5
3 0.0
4 1.8
5 -0.8
Now you can apply a function elementwise that only distiguishes >0, <0 or ==0, using pandas.DataFrame.applymap
def direction(x):
if x > 0:
return 'up'
elif x < 0:
return 'down'
elif x == 0:
return 'nochange'
df_diff.applymap(direction))
Col1
0 None
1 down
2 up
3 nochange
4 up
5 down
Finally, it's a design decision what should happen to the first value of the series. Here the NaN value doesn't fit any case. You can also treat it separately in direction, or omit in in your result by slicing.
Edit: The same as a oneliner:
df.diff().applymap(lambda x: 'up' if x > 0 else ('down' if x < 0 else ('nochange' if x == 0 else None)))

You can use this:
df['trend'] = np.where(df.Col1.shift().isnull(), "N/A", np.where(df.Col1 == df.Col1.shift(), "nochange", np.where(df.Col1 < df.Col1.shift(), "down", "up")))
Col1 trend
0 2.5 N/A
1 1.5 down
2 3.0 up
3 3.0 nochange
4 4.8 up
5 4.0 down

A For loop with embedded if statement to update a dataframe

I am a newcomer to python. I want to implement a "For" loop on the elements of a dataframe, with an embedded "if" statement.
Code:
import numpy as np
import pandas as pd
#Dataframes
x = pd.DataFrame([1,-2,3])
y = pd.DataFrame()
for i in x.iterrows():
for j in x.iteritems():
if x>0:
y = x*2
else:
y = 0
With the previous loop, I want to go through each item in the x dataframe and generate a new dataframe y based on the condition in the "if" statement. When I run the code, I get the following error message.
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any help would be much appreciated.

In pandas is best avoid loops if exist vectorized solution:
x = pd.DataFrame([1,-2,3], columns=['a'])
y = pd.DataFrame(np.where(x['a'] > 0, x['a'] * 2, 0), columns=['b'])
print (y)
b
0 2
1 0
2 6
Explanation:
First compare column by value for boolean mask:
print (x['a'] > 0)
0 True
1 False
2 True
Name: a, dtype: bool
Then use numpy.where for set values by conditions:
print (np.where(x['a'] > 0, x['a'] * 2, 0))
[2 0 6]
And last use DataFrame constructor or create new column:
x['new'] = np.where(x['a'] > 0, x['a'] * 2, 0)
print (x)
a new
0 1 2
1 -2 0
2 3 6

You can try this:
y = (x[(x > 0)]*2).fillna(0)

Pandas. Check that rows are partly equals

I have table:
tel size 1 2 3 4
0 123 1 Baby Baby None None
1 234 1 Shave Shave None None
2 222 1 Baby Baby None None
3 333 1 Shave Shave None None
I want to check if values in tables 1,2,3,4 ... are partly equal with 2 loops:
x = df_map.iloc[i,2:]
y = df_map.iloc[j,2:]
so df_map.iloc[0,2:] should be equal to df_map.iloc[2,2:],
and df_map.iloc[1,2:], is equal to df_map.iloc[3,2:],
I tried:
x == y
and
y.eq(x)
but it returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If i use (x==y).all() or (x==y).any() it returns wrong result.
I need something like:
if x== y:
counter += 1
Update:
problem was in None values. I used fillna('') and (x == y).all()

fillna('') because None == None is False
use numpy broadcasting evaluate ==
all(-1) to make sure the whole row matches
np.fill_diagonal because we don't need self matches
np.where to find where the matches are
v = df.fillna('').values[:, 2:]
match = ((v[None, :] == v[:, None]).all(-1))
np.fill_diagonal(match, False)
i, j = np.where(match)
pd.Series(i, j)
2 0
3 1
0 2
1 3
dtype: int64

THe type of pandas series(rows,columns) are numpy array,you can only get the results by column to column unless you loop again from the results which is also another array
import numpy as np
x = df_map[i,2:]
y = df_map[j,2:]
np.equal(y.values,x.values)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert a set of probabilities to 0 and 1? - python

You can use: (df['proba'] >= .5).astype(int) or df['proba'].ge(.5).astype(int)

y_hat = list(map(lambda x: 0 if x else 1, (df_5a['proba']<0.5).values))

Related

Pandas apply custom function to DF

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). IF Statements [duplicate]

compare value of the current index to the value of next index in Pandas df

A For loop with embedded if statement to update a dataframe

Pandas. Check that rows are partly equals

Categories

Resources