performance of a multi-class classifier using the 3 highest probabilities - python

I have a pandas dataframe as below
predictions.head()
Out[22]:
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U
Each row is a the prediction probability of the different classes (columns).
The last column is the label (correct class).
I would like to evaluate the performances of the classifiers allowing 2 errors.
What I mean is that if one of the highest 3 probabilities is the correct label I consider the prediction correct.
Is there a smart way to do it in scikit-learn?

Try this approach:
In [57]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index).T
In [58]: x
Out[58]:
0 1 2
0 L A N
1 L U A
2 P A S
3 L N B
4 R S A
In [59]: x.eq(df.LABEL, axis=0).any(1)
Out[59]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
similar solution, which uses one transpose less:
In [66]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [67]: x
Out[67]:
0 1 2 3 4
0 L L P L R
1 A U A N S
2 N A S B A
In [68]: x.eq(df.LABEL).any()
Out[68]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
Source DF:
In [70]: df
Out[70]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
UPDATE: trying to reproduce the error (from comments):
In [81]: df
Out[81]:
a b c d e LABEL
0 1 2 3 4 5 c
1 3 4 5 6 7 d
In [82]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [83]: x
Out[83]:
0 1
0 e e
1 d d
2 c c
In [84]: x.eq(df.LABEL).any()
Out[84]:
0 True
1 True
dtype: bool
PS I'm using Pandas 0.23.0

If performance is important use numpy.argsort with remove last column by iloc:
print (np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3])
[[ 7 0 8]
[ 7 12 0]
[ 9 0 11]
[ 7 8 1]
[10 11 0]]
v = df.columns[np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3]]
print (v)
Index([['L', 'A', 'N'], ['L', 'U', 'A'], ['P', 'A', 'S'], ['L', 'N', 'B'],
['R', 'S', 'A']],
dtype='object')
a = pd.DataFrame(v).eq(df['LABEL'], axis=0).any(axis=1)
print (a)
0 True
1 True
2 False
3 False
4 False
dtype: bool
Thanks, #Maxu for another similar solution with numpy.argpartition:
v = df.columns[np.argpartition(-df.iloc[:, :-1].values, 3, axis=1)[:,:3]]
Sample data:
df = pd.DataFrame({'A': [0.718363, 0.677776, 0.6820189999999999, 0.48298100000000005, 0.700207], 'B': [0.5, 0.5, 0.5, 0.5, 0.5], 'C': [0.403466, 0.366128, 0.074347, 0.06543600000000001, 0.515825], 'D': [0.5, 0.5, 0.5, 0.5, 0.5], 'E': [0.5, 0.5, 0.5, 0.5, 0.5], 'G': [0.45898900000000004, 0.042405, 0.562217, 0.112383, 0.07808899999999999], 'H': [0.5, 0.5, 0.5, 0.5, 0.5], 'L': [0.85019, 0.8942, 0.417786, 0.7436590000000001, 0.43783900000000003], 'N': [0.6208779999999999, 0.510644, 0.539949, 0.604382, 0.249892], 'P': [0.182169, 0.015788999999999997, 0.7992229999999999, 0.24676599999999999, 0.064734], 'R': [0.48363100000000003, 0.523462, 0.603212, 0.39907, 0.8228340000000001], 'S': [0.43291499999999994, 0.547838, 0.6208060000000001, 0.34108099999999997, 0.769277], 'U': [0.328495, 0.691239, 0.335204, 0.22940700000000003, 0.512239], 'V': [0.5, 0.5, 0.5, 0.5, 0.5], 'LABEL': ['A', 'L', 'G', 'P', 'U']})
print (df)
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U

I can't think of a solution in sklearn so here's one in pandas
# Data
predictions
Out[]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
# Check if the label is in the top 3 (one line solution)
predictions.apply(lambda row: row['LABEL'] in list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 True
1 True
2 False
3 False
4 False
Here is what is happening:
# List the top 3 results:
predictions.apply(lambda row: list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 [N, A, L]
1 [A, U, L]
2 [S, A, P]
3 [V, N, L]
4 [A, S, R]
# Then check if the 'LABEL' is inside this list
You could ask this question on Cross Validated as they will use sklearn extensively

Related

How do you calculate the sum based on certain numbers in the dataframe?

I have variables like this
a = pd.DataFrame(np.array([[1, 1, 2, 3, 2], [2, 2, 3, 3, 2], [1, 2, 3, 2, 3]]))
b = np.array([0.1, 0.3, 0.5, 0.6, 0.2])
Display a
0 1 2 3 4
0 1 1 2 3 2
1 2 2 3 3 2
2 1 2 3 2 3
Display b
[0.1 0.3 0.5 0.6 0.2]
The result I want is the sum of the values in b based on the values of a where the indices of a serve as the indices for the values in b .
The final result that I want is like this.
0.4 0.7 0.6
0 0.5 0.11
0.1 0.9 0.7
How to obtain the first row in detail
0.4 0.7 0.6
so 0.4 is obtained from 0.1 + 0.3, based on the number 1 in the first row of a, i.e. since the indices are 0 and 1, we add b[0] and b[1]
0.7 is obtained from 0.5 + 0.2, based on the number 2 where the indices are 2 and 4, so we added b[2] + b[4]
0.6 based on the number 3 which is just b[3] because the index is 3
You can create one-hot encoded matrices to use in a dot product:
from pandas.api.types import CategoricalDtype
n = a.max().max()
cat = CategoricalDtype(categories=np.arange(1, n + 1))
dummies = pd.get_dummies(a.T.astype(cat))
b.dot(dummies).reshape(n, n)
yields
array([[0.4, 0.7, 0.6],
[0. , 0.6, 1.1],
[0.1, 0.9, 0.7]])
This is one way you can do it, it is not optimized, yet I think it follows your logic in a clear way:
df = pd.DataFrame(columns=range(1, a.max().max()+1))
for i,r in a.iterrows():
for c in list(df):
df.loc[i,c] = np.sum((b[r[r==c].index.values]))
df
1 2 3
0 0.4 0.7 0.6
1 0 0.6 1.1
2 0.1 0.9 0.7

Pandas conditional map/fill/replace

d1=pd.DataFrame({'x':['a','b','c','c'],'y':[-1,-2,-3,0]})
d2=pd.DataFrame({'x':['d','c','a','b'],'y':[0.1,0.2,0.3,0.4]})
I want to replace d1.y where y<0 with the correspondent y in d2. It's something like vlookup in Excel. The core problem is replace y according to x rather than just simply manipulate y. What I want is
Out[40]:
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
Use Series.map with condition:
s = d2.set_index('x')['y']
d1.loc[d1.y < 0, 'y'] = d1['x'].map(s)
print (d1)
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
You can try this:
d1.loc[d1.y < 0, 'y'] = d2.loc[d1.y < 0, 'y']

How do I calculate moving average with customized weight in pandas?

I have a dataframe than contains two columns, a: [1,2,3,4,5]; b: [1,0.4,0.3,0.5,0.2]. How can I make a column c such that:
c[0] = 1
c[i] = c[i-1]*b[i]+a[i]*(1-b[i])
so that c:[1,1.6,2.58,3.29,4.658]
Calculation:
1 = 1
1*0.4+2*0.6 = 1.6
1.6*0.3+3*0.7 = 2.58
2.58*0.5+4*0.5 = 3.29
3.29*0.2+5*0.8 = 4.658
?
I can't see a way to vectorise your recursive algorithm. However, you can use numba to optimize your current logic. This should be preferable to a regular loop.
from numba import jit
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [1,0.4,0.3,0.5,0.2]})
#jit(nopython=True)
def foo(a, b):
c = np.zeros(a.shape)
c[0] = 1
for i in range(1, c.shape[0]):
c[i] = c[i-1] * b[i] + a[i] * (1-b[i])
return c
df['c'] = foo(df['a'].values, df['b'].values)
print(df)
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
There could be a smarter way, but here's my attempt:
import pandas as pd
a = [1,2,3,4,5]
b = [1,0.4,0.3,0.5,0.2]
df = pd.DataFrame({'a':a , 'b': b})
for i in range(len(df)):
if i is 0:
df.loc[i,'c'] = 1
else:
df.loc[i,'c'] = df.loc[i-1,'c'] * df.loc[i,'b'] + df.loc[i,'a'] * (1 - df.loc[i,'b'])
Output:
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658

Get Trend/Streak in Each Row of Pandas DataFrame

I have a Pandas DataFrame:
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0],
['D', -0.1, -1.0, -4.0, -3.3, -1.0],
['E', np.nan, np.nan, np.nan, np.nan, np.nan],
['F', 4.0, np.nan, np.nan, np.nan, np.nan]
], columns=['Group', '1', '2', '3', '4', '5'])
Group 1 2 3 4 5
0 A 0.1 2.0 1.0 0.5 0.3
1 B -0.3 -0.4 0.1 0.2 -1.0
2 C 0.1 -1.0 4.0 -3.3 1.0
3 D -0.1 -1.0 -4.0 -3.3 -1.0
4 E NaN NaN NaN NaN NaN
5 F 4.0 NaN NaN NaN NaN
For each row, I'd like to return the trend/streak of either consecutive positive/negative values going from left to right. So, the final DataFrame should be:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
The first row has a streak of +5 because the values are all positive going from left to right. The second row has a streak of negative -2 because the first two columns have negative values and the streak ends with a positive value in column 3. The third row has a streak of +1 because the second column has an opposite sign from the first column. The fourth row is all NaN and so the streak is zero.
This is a bit long-winded, but it seems to do everything you need:
def streak(row):
cols = row.keys()
n_cols = len(cols)
neg_streak = 0
pos_streak = 0
i_neg_streak = n_cols
i_pos_streak = n_cols
for icol_1 in range(n_cols - 1):
for icol_2 in range(icol_1, n_cols):
if (row.ix[icol_1: icol_2 + 1] < 0).all():
streak = icol_1 - icol_2 - 1
if streak < neg_streak:
neg_streak = streak
i_neg_streak = icol_1
elif (row.ix[icol_1: icol_2 + 1] > 0).all():
streak = 1 + icol_2 - icol_1
if streak > pos_streak:
pos_streak = streak
i_pos_streak = icol_1
if pos_streak == abs(neg_streak):
if i_pos_streak < i_neg_streak:
return pos_streak
else:
return neg_streak
elif pos_streak > abs(neg_streak):
return pos_streak
else:
return neg_streak
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0]
], columns=['Group', '1', '2', '3', '4', '5'])
df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()
print df
This did the trick and is more intuitive/vectorized
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:] # Compare values from neighboring columns
So diff looks like this:
[[ True True True True]
[ True False True False]
[False False False False]
[ True True True True]]
Then,
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1) # Add False column to end of diff
[[ True True True True False]
[ True False True False False]
[False False False False False]
[ True True True True False]]
Next, we look for streaks of True by looking for the first occurrence of False:
df['Streak'] = np.argmin(diff, axis=1) + 1 # Add 1 to the index get the streak
Finally, we adjust the sign of the streak value according to the sign of the first column:
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
The final DataFrame looks like this:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
I'm assuming you want the longest streak. Can't make any promises about ties... This answer uses itertools.groupby. First, under the hood so you can see what groupby is doing:
In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
for k,g in groupby(b, key=lambda x: x > 0.0):
print k,list(g)
False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]
Now wrap that in a function, taking advantage of the groupings:
def streak(dfrow):
longest= 0
for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
cur_streak = len(list(g))
if np.isnan(k):
continue
if k: #group is positive
if abs(longest) < cur_streak:
longest= cur_streak
else: #group is negative
if abs(longest) < cur_streak:
longest= -1*cur_streak #multiply by -1
return longest
Use df.apply to apply function to each row:
In [6]: df.set_index('Group',inplace=True)
df['LongestStreak'] = df.apply(streak, axis=1)
Result:
In [281]: df
Out[281]: 1 2 3 4 5 LongestStreak
Group
A 0.1 2.0 1.0 0.5 0.3 5
B -0.3 -0.4 0.1 0.2 -1.0 -2
C 0.1 -1.0 4.0 -3.3 1.0 1
EDIT
Updated to address your new DataFrame and added a benchmark, your's probably scales better, but I don't know how to modify your code to generate the results.
Results:
%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)
1000 loops, best of 3: 473 µs per loop
%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
100 loops, best of 3: 2.94 ms per loop

Calculating a new column in pandas

I have a dataframe of historical election results and want to calculate an additional column that applies a basic math formula for records for winning candidates and copies a value over for the rest of them.
Here is the code I tried:
va2 = va1[['contest_id', 'year', 'district', 'office', 'party_code',
'pct_vote', 'winner']].drop_duplicates()
va2['vote_waste'] = va2['winner'].map(lambda x: (-.5) + va2['pct_vote']
if x == 'w' else va2['pct_vote'])
This gave me a new column where each row contained the calculation for every row in every row.
You can use numpy.where() to achieve what you want:
import pandas as pd
import numpy as np
data = {
'winner': pd.Series(['w', 'l', 'l', 'w', 'l']),
'pct_vote': pd.Series([0.4, 0.9, 0.9, 0.4, 0.9]),
'party_code': pd.Series([10, 20, 30, 40, 50])
}
df = pd.DataFrame(data)
print(df)
party_code pct_vote winner
0 10 0.4 w
1 20 0.9 l
2 30 0.9 l
3 40 0.4 w
4 50 0.9 l
df['vote_waste'] = np.where(
df['winner'] == 'w',
df['pct_vote'] - 0.5, #if condition is true, use this value
df['pct_vote'] #if condition is false, use this value
)
print(df)
party_code pct_vote winner vote_waste
0 10 0.4 w -0.1
1 20 0.9 l 0.9
2 30 0.9 l 0.9
3 40 0.4 w -0.1
4 50 0.9 l 0.9
This is because you are operating a element x against series va2['pct_vote']. What you need is operation on va2['winner'] and va2['pct_vote'] element wise. You could use apply to achieve that.
consider a as winner and b as pct_vote
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
df
Out[23]:
a b c
0 1 2 3
1 4 5 6
df['new'] = df[['a','b']].apply(lambda x : (-0.5)+x[1] if x[0] ==1 else x[1],axis=1)
df
Out[42]:
a b c new
0 1 2 3 1.5
1 4 5 6 5.0

Categories

Resources