I'm porting some matlab/octave scripts for support vector machines (SVMs) to python but I'm getting poor accuracy in one of two scripts with the sklearn method.
ex6_spam.py loads some data trains a spam-detecting model.
In matlab, the SVM code provided, svmTrain.m, (see below for snippets) gives me ~99% accuracy in both the training and the test sets.
In python, sklearn.svm.SVM().fit() is giving me ~56% if I just use their linear kernel, and ~44% if I precompute the Gram matrix for a linear kernel. (The data and code - ex6_spam.py - are here.)
The odd thing, too, is that the exact same piece of code used in ex6.py gives me proper classification of 2D data points. Its behavior there is almost identical to the matlab/octave script.
I'm not doing much in ex6_spam.py - I load a training set:
mat = scipy.io.loadmat('spamTrain.mat')
X = mat["X"]
y = mat["y"]
I feed it to sklearn.svm.SVM().fit():
C = 0.1
model = svmt.svmTrain(X, y, C, "linear")
# this results in
# clf = svm.SVC(C = C, kernel=kernelFunction, tol=tol, max_iter=max_passes, verbose=2)
# return clf.fit(X, y)
and the I make a prediction:
p = model.predict(X)
The matlab/octave equivalent is
load('spamTrain.mat');
C = 0.1;
model = svmTrain(X, y, C, #linearKernel); # see the link to svmTrain.m above
p = svmPredict(model, X);
However, the results are wildly different. Any ideas why? I haven't had the chance to run it in a different computer, but maybe that's a possible reason?
Related
This question already has an answer here:
what is the difference between fit() ,fit_transform() and transform() in scikit_learn? [duplicate]
(1 answer)
Closed 1 year ago.
I'm working with SVM model to classify 5 different classes. (N1, N2, N3, W, R)
Feature extractions -> Data normalization -> train SVM
when I tested the model (20%, 80% usual train-test-split), it shows high accuracy enter image description here
But when I tried testing with a completely new dataset, with the same method of
Feature extractions -> Data normalization -> test on trained SVM model
It came out really badly.
Let's say the original dataset used in training is A, and the new test dataset is B.
when I trained the model only with A and tested B, it came out really badly.
First I thought it was model overfitting so I included A and B to train the model and tested with B. It came out badly again...
I think the problem is the normaliztion process. It eventually worked when I tried new dataset C, but this time I brought the train A data, concatenated A+C to normalize, and then cut only C dataset out from it. And when I compared that with the data C normalized alone, it was different..
I used MinMaxScaler from sklearn.
I mean mathematically speaking of course it's different.. because every dataset has different minimum maximum value and normalized data will be different when mixed with other data.
My question is, when you test with new dataset, is it normal to bring the train dataset to normalize it together and then take out the test datapart only?? It's like mixing A(112x12), B(15x12) -> normalize (127x12) together -> take out (15x12)
Or should I start from fixing the code from feature extraction and training SVM?
(I attached the code, and each feature has 12x1 shape which means each stage has 12xN matrix.)
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
# Load training data
N1_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N1_features")
N2_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N2_features")
N3_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N3_features")
W_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_W_features")
R_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_R_features")
# Load test data
N1_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N1_features")
N2_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N2_features")
N3_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N3_features")
W_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_W_features")
R_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_R_features")
# normalize with original raw features and take only test out
N1_scaled_test = features.normalize_together(N1_test, N1_train, "N1")
N2_scaled_test = features.normalize_together(N2_test, N2_train, "N2")
N3_scaled_test = features.normalize_together(N3_test, N3_train, "N3")
W_scaled_test = features.normalize_together(W_test, W_train, "W")
R_scaled_test = features.normalize_together(R_test, R_train, "R")
def normalize_together(test, raw, stage_no):
together = pd.concat([test, raw], ignore_index=True)
scaled_test = pd.DataFrame(scaler.fit_transform(together.iloc[:, :-1]))
scaled_test['label'] = "{}".format(stage_no)
scaled_test = scaled_test.iloc[0:test.shape[0], :]
return scaled_test
Test data should remain unseen during training (includes preprocessing) - don't use both test + train data to compute a common normalisation factor. Normalise the training set. Separately, normalise the test set.
Why? It's vital to use an unseen test partition to evaluate your trained model. Otherwise you have not tested the ability for your model to generalise - imagine playing a game of cards where you have already have prior knowledge of the cards or order of the deck.
I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?
I’m trying to evaluate the influence of the # of features & parameter C (SVM regularization) on the prediction time. I am using a modified version of code proposed by scikit-learn website.
Here are some key lines of code :
input
'n_train': int(2000),
'n_test': int( 500),
'n_features': np.arange(10,100,10)
Functions
SVC(kernel='linear', C=0.001)
SVC(kernel='linear', C=0.01)
SVC(kernel='linear', C=1)
SVC(kernel='linear', C=100)
predictions
estimator.fit(X_train, y_train)
....
start = time.time()
estimator.predict(X_test)
runtimes[i] = time.time() - start
Output : Evolution of Prediction Time
I don’t understand why the prediction time is reversed. According to many resources (3 and others), the latency should increase with C parameter of SVM function.
Having a larger C will lead to smaller values for the slack variables. This means that the number of support vectors will decrease. When you run the prediction, it will need to calculate the indicator function for each support vector.
Thus; smaller C -> more support vectors -> more calculations -> slower predictions.
Hi I have a dataframe test, I am trying to predict using a Gaussian HMM with hmmlearn.
When I do this:
y = model.predict(test)
y
I get the hmm working fine producing and array of states
however if i do this:
for i in range(0,len(test)):
y = model.predict(test[:i])
all I get is y being set to 1.
Can anyone help?
UPDATE
here is the code that does work iterating through
The training set was 0-249:
for i in range(251,len(X)):
test = X[:i]
y = model.predict(test)
print(y[len(y)-1])
HMM models sequences of observations. If you feed a single observation into predict (which does Viterbi decoding by default) you essentially reduce the prediction to the argmax over
(model.startprob_ * model.predict_proba(test[i:i + 1])).argmax()
which can be dominated by startprob_, e.g. if startprob = [10**-8, 1 - 10**-8]. This could explain the all-ones behaviour you're seeing.
I apply the random forest algorithm in three different programming languages to the same pseudo sample dataset (1000 obs, binary 1/0 dependent variable, 10 numeric explanatory variables):
Matlab 2015a (same for 2012a) using the "Treebagger" command (part of the Statistics and Machine Learning Toolbox)
R using the "randomForest" package: https://cran.r-project.org/web/packages/randomForest/index.html
Python using the "RandomForestClassifier" from sklearn.ensemble: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
I also try to to keep all model parameters identical across programming languages (no. of trees, bootstrap sampling of the whole sample, no. of variables randomly sampled as candidates at each split, criterion to measure the quality of a split).
While Matlab and Python produce basically the same results (i.e. probabilties), the R results are very different.
What could be the possible reason for the difference between the results produced by R on the one hand side, and by Matlab & Python on the other?
I guess there's some default model parameter that differs in R which I'm not aware of or which is hard-coded in the underlying randomForest package.
The exact code I ran looks as follows:
Matlab:
b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
[~,scores,~] = predict(b,X);
Python:
clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
scores_fit = clf.fit(X, Y)
scores = pd.DataFrame(clf.predict_proba(X))
R:
results.rf <- randomForest(X,Y, ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
scores <- predict(results.rf, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
When you call predict on a randomForest object in R without providing a dataset, it returns the out-of-bag predictions. In your other methods, you are passing in the training data again. I suspect that if you do this in the R version, your probabilities will be similar:
scores <- predict(results.rf, X, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
Also note, that if you want unbiased probabilites, the R approach of returning OOB predictions is the best approach when predicting on training data.