Replication of scikit.svm.SRV.predict(X) - python

I'm trying to replicate scikit-learn's svm.svr.predict(X) and don't know how to do it correctly.
I want to do is, because after training the SVM with an RBF kernel I would like to implement the prediction on another programming language (Java) and I would need to be able to export the model's parameters to be able to perform predictions of unknown cases.
On scikit's documentation page, I see that there are 'support_ and 'support_vectors_ attributes, but don't understand how to replicate the .predict(X) method.
A solution of the form y_pred = f(X,svm.svr.support_, svm.svr.support_vectors_,etc,...) is what I am looking for.
Thank you in advance!
Edit:
Its SVM for REGRESSION, not CLASSIFICATION!
Edit:
This is the code I am trying now, from Calculating decision function of SVM manually with no success...
from sklearn import svm
import math
import numpy as np
X = [[0, 0], [1, 1], [1,2], [1,2]]
y = [0, 1, 1, 1]
clf = svm.SVR(gamma=1e-3)
clf.fit(X, y)
Xtest = [0,0]
print 'clf.decision_function:'
print clf.decision_function(Xtest)
sup_vecs = clf.support_vectors_
dual_coefs = clf.dual_coef_
gamma = clf.gamma
intercept = clf.intercept_
diff = sup_vecs - Xtest
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
print 'decision_function replication:'
print dec_func_vec
The results I'm getting are different for both methods, WHY??
clf.decision_function:
[[ 0.89500898]]
decision_function replication:
[ 0.89900498]

Thanks to the contribution of B#rmaley.exe, I found the way to replicate SVM manually. I had to replace
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
with
dec_func_vec = (dual_coefs.dot(np.exp(-gamma*(norm2**2))) + intercept)
So, the full vectorized method is:
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)

Related

How to interpret the coefficients returned from a multivariate cubic regression (polynomial degree 3) when using linearRegression().coef_?

I am trying to fit a hyperplane to a dataset which includes 2 features and 1 target variable. I processed the features using PolynomialFeatures.fit_transform() and PolynomialFeature(degree = 3), and then fitted those features and target variable into a LinearRegression() model. When I use LinearRegression().coef_ to get the coefficients in order to write out a function for the hyperplane (I want the written out function itself), 10 coefficients are returned and I don't know how to interpret them into a function. I know that for a PolynomialFeature(degree = 2) model, 6 coefficients are returned and the function looks like m[0] + x1*m[1] + x2*m[2] + (x1**2)*m[3] + (x2**2)*m[4] + x1*x2*m[5] where m is the list of coefficients returned in that order. How would I interpret the cubic one?
Here is what my code for thee cubic model looks like:
poly = polyF(degree = 3)
x_poly = poly.fit_transform(x)
model = linR()
model.fit(x_poly, y)
model.coef_
(returns):
array([ 0.00000000e+00, -1.50603348e+01, 2.33283686e+00, 6.73172519e-01,
-1.93686431e-01, -7.30930307e-02, -9.31687047e-03, 3.48729458e-03,
1.63718406e-04, 2.26682333e-03])
So if (X1,X2) transforms to (1,X1,X2,X1^2,X1X2,X2^2)
Then (X1,X2,X3) should transform to
(1,
X1, X2, X3,
X1X2, X1X3, X2X3,
X1^2 * X2, X2^2 * X3, X3^2 * X1)
I was facing the same question and developed the following code block to print the fit equation. To do so, it was necessary to include_bias=True in PolynomialFeatures and to set fit_intercept=False in LinearRegression, as opposed to conventional use:
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
def polyReg():
seed=12341
df=pd.read_csv("input.txt", delimiter=', ', engine='python')
X=df[["x1","x2","x3"]]
y=df["y"]
poly=PolynomialFeatures(degree=2,include_bias=True)
poly_X=poly.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(poly_X,y,test_size=0.5,random_state=seed)
regression=linear_model.LinearRegression(fit_intercept=False)
fit=regression.fit(X_train,y_train)
variable_names=poly.get_feature_names_out(X.columns)
variable_names=np.core.defchararray.replace(variable_names.astype(str),' ','*')
fit_coeffs=["{:0.5g}".format(x) for x in fit.coef_]
arr_list=[fit_coeffs,variable_names]
fit_equation=np.apply_along_axis(join_txt, 0, arr_list)
fit_equation='+'.join(fit_equation)
fit_equation=fit_equation.replace("*1+","+")
fit_equation=fit_equation.replace("+-","-")
print("Fit equation:")
print(fit_equation)
def join_txt(text,delim='*'):
return np.asarray(delim.join(text),dtype=object)

why my PCA and PCA from sklearn get different results?

I tried to use the PCA provided in "machine learning in action", but I found that the results obtained by it are not the same as those obtained by the PCA in sklearn. I don't quite understand what is going on.
Below is my code:
import numpy as np
from sklearn.decomposition import PCA
x = np.array([
[1,2,3,4,5, 0],
[0.6,0.7,0.8,0.9,0.10, 0],
[110,120,130,140,150, 0]
])
def my_pca(data, dim):
remove_mean = data - data.mean(axis=0)
cov_data = np.cov(remove_mean, rowvar=0)
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
sorted_eig_val = np.argsort(eig_val)
eig_index = sorted_eig_val[:-(dim+1):-1]
transfer = eig_vec[:,eig_index]
low_dim = remove_mean * transfer
return np.array(low_dim, dtype=float)
pca = PCA(n_components = 3)
pca.fit(x)
new_x = pca.transform(x)
print("sklearn")
print(new_x)
new_x = my_pca(x, 3)
print("my")
print(new_x)
Output:
sklearn
[[-9.32494230e+01 1.46120285e+00 2.37676120e-15]
[-9.89004904e+01 -1.43283197e+00 2.98143675e-14]
[ 1.92149913e+02 -2.83708789e-02 2.81307176e-15]]
my
[[ 9.32494230e+01 -1.46120285e+00 7.39333927e-14]
[ 9.89004904e+01 1.43283197e+00 -7.01760428e-14]
[-1.92149913e+02 2.83708789e-02 1.84375626e-14]]
The issue relates to your function, in particular the part where you calculate your eigenvector and eigenvalues:
eig_val, eig_vec = np.linalg.eig(np.mat(cov_data))
It appears that ScitKit learn uses "eigh" instead of "eig", so if you change the code snippet from np.linalg.eig to np.linalg.eigh, you should get the same results.

roc_auc_score - Only one class present in y_true

I am doing a k-fold XV on an existing dataframe, and I need to get the AUC score.
The problem is - sometimes the test data only contains 0s, and not 1s!
I tried using this example, but with different numbers:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)
And I get this exception:
ValueError: Only one class present in y_true. ROC AUC score is not
defined in that case.
Is there any workaround that can make it work in such cases?
You could use try-except to prevent the error:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
try:
roc_auc_score(y_true, y_scores)
except ValueError:
pass
Now you can also set the roc_auc_score to be zero if there is only one class present. However, I wouldn't do this. I guess your test data is highly unbalanced. I would suggest to use stratified K-fold instead so that you at least have both classes present.
As the error notes, if a class is not present in the ground truth of a batch,
ROC AUC score is not defined in that case.
I'm against either throwing an exception (about what? This is the expected behaviour) or returning another metric (e.g. accuracy). The metric is not broken per se.
I don't feel like solving a data imbalance "issue" with a metric "fix". It would probably be better to use another sampling, if possibile, or just join multiple batches that satisfy the class population requirement.
I am facing the same problem now, and using try-catch does not solve my issue. I developed the code below in order to deal with that.
import pandas as pd
import numpy as np
class KFold(object):
def __init__(self, folds, random_state=None):
self.folds = folds
self.random_state = random_state
def split(self, x, y):
assert len(x) == len(y), 'x and y should have the same length'
x_, y_ = pd.DataFrame(x), pd.DataFrame(y)
y_ = y_.sample(frac=1, random_state=self.random_state)
x_ = x_.loc[y_.index]
event_index, non_event_index = list(y_[y == 1].index), list(y_[y == 0].index)
assert len(event_index) >= self.folds, 'number of folds should be less than the number of rows in x'
assert len(non_event_index) >= self.folds, 'number of folds should be less than number of rows in y'
indexes = []
#
#
#
step = int(np.ceil(len(non_event_index) / self.folds))
start, end = 0, step
while start < len(non_event_index):
train_fold = set(non_event_index[start:end])
valid_fold = set([k for k in non_event_index if k not in train_fold])
indexes.append([train_fold, valid_fold])
start, end = end, min(step + end, len(non_event_index))
#
#
#
step = int(np.ceil(len(event_index) / self.folds))
start, end, i = 0, step, 0
while start < len(event_index):
train_fold = set(event_index[start:end])
valid_fold = set([k for k in event_index if k not in train_fold])
indexes[i][0] = list(indexes[i][0].union(train_fold))
indexes[i][1] = list(indexes[i][1].union(valid_fold))
indexes[i] = tuple(indexes[i])
start, end, i = end, min(step + end, len(event_index)), i + 1
return indexes
I just wrote that code and I did not tested it exhaustively. It was tested only for binary categories. Hope it be useful yet.
You can increase the batch-size from e.g. from 32 to 64, you can use StratifiedKFold or StratifiedShuffleSplit. If the error still occurs, try shuffeling your data e.g. in your DataLoader.
Simply modify the code with 0 to 1 make it work
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 1, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)
I believe the error message has suggested that only one class in y_true (all zero), you need to give 2 classes in y_true.

Canonical Discriminant Function in Python sklearn

I am learning about Linear Discriminant Analysis and am using the scikit-learn module. I am confused by the "coef_" attribute from the LinearDiscriminantAnalysis class. As far as I understand, these are the discriminant function coefficients (sklearn calls them weight vectors). Since there should be (n_classes-1) discriminant functions, I would expect the coef_ attribute to be an array with shape (n_components, n_features), but instead it prints an (n_classes, n_features) array. Below is an example of this using the Iris dataset example from sklearn. Since there are 3 classes and 2 components, I would expect print(lda.coef_) to give me a 2x4 array instead of a 3x4 array...
Maybe I'm misinterpreting what the weight vectors are, perhaps they are the coefficients for the classification function?
And how do I get the coefficients for each variable in each discriminant/canonical function?
screenshot of jupyter notebook
Code here:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2,store_covariance=True)
X_r = lda.fit(X, y).transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Function 1 (%.2f%%)' %(lda.explained_variance_ratio_[0]*100))
plt.ylabel('Function 2 (%.2f%%)' %(lda.explained_variance_ratio_[1]*100))
plt.title('LDA of IRIS dataset')
print(lda.coef_)
#output -> [[ 6.24621637 12.24610757 -16.83743427 -21.13723331]
# [ -1.51666857 -4.36791652 4.64982565 3.18640594]
# [ -4.72954779 -7.87819105 12.18760862 17.95082737]]
You can calculate the coefficients with the following code:
def LDA_coefficients(X,lda):
nb_col = X.shape[1]
matrix= np.zeros((nb_col+1,nb_col), dtype=int)
Z=pd.DataFrame(data=matrix,columns=X.columns)
for j in range(0,nb_col):
Z.iloc[j,j] = 1
LD = lda.transform(Z)
nb_funct= LD.shape[1]
results = pd.DataFrame();
index = ['const']
for j in range(0,LD.shape[0]-1):
index = np.append(index,'C'+str(j+1))
for i in range(0,LD.shape[1]):
coef = [LD[-1][i]]
for j in range(0,LD.shape[0]-1):
coef = np.append(coef,LD[j][i]-LD[-1][i])
result = pd.Series(coef)
result.index = index
column_name = 'LD' + str(i+1)
results[column_name] = result
return results
Before calling this function you need to complete the linear discriminant analysis:
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

In Ridge Regression, we are solving Ax=b with L2 Regularization. The direct calculation is given by:
x = (ATA + alpha * I)-1ATb
I have looked at the scikit-learn code and they do implement the same calculation. But, I can't seem to get the same results for alpha > 0
The minimal code to reproduce this.
import numpy as np
A = np.asmatrix(np.c_[np.ones((10,1)),np.random.rand(10,3)])
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print(x.T)
>>> [[ 0.37371021 0.19558433 0.06065241 0.17030177]]
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha).fit(A[:,1:],b)
print(np.c_[model.intercept_, model.coef_])
>>> [[ 0.61241566 0.02727579 -0.06363385 0.05303027]]
Any suggestions on what I can do to resolve this discrepancy?
This modification seems to yield the same result for the direct version and the numpy version:
import numpy as np
A = np.asmatrix(np.random.rand(10,3))
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print (x.T)
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha, tol=0.1, fit_intercept=False).fit(A ,b)
print model.coef_
print model.intercept_
It seems the main reason for the difference is the class Ridge has the parameter fit_intercept=True (by inheritance from class _BaseRidge) (source)
This is applying a data centering procedure before passing the matrices to the _solve_cholesky function.
Here's the line in ridge.py that does it
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=sample_weight)
Also, it seems you were trying to implicitly account for the intercept by adding the column of 1's. As you see, this is not necessary if you specify fit_intercept=False
Appendix: Does the Ridge class actually implement the direct formula?
It depends on the choice of the solverparameter.
Effectively, if you do not specify the solverparameter in Ridge, it takes by default solver='auto' (which internally resorts to solver='cholesky'). This should be equivalent to the direct computation.
Rigorously, _solve_cholesky uses numpy.linalg.solve instead of numpy.inv. But it can be easily checked that
np.linalg.solve(A.T*A + alpha * I, A.T*b)
yields the same as
np.linalg.inv(A.T*A + alpha * I)*A.T*b

Categories

Resources