Sklearn SVM vs Matlab SVM - python

Problem: I need to train a classifier (in matlab) to classify multiple levels of signal noise.
So i trained a multi class SVM in matlab using the fitcecoc and obtained an accuracy of 92%.
Then i trained a multiclass SVM using sklearn.svm.svc in python, but it seems that however i fiddle with the parameters, i cannot achieve more than 69% accuracy.
30% of the data was held back and used to verify the training. the confusion matrixes can be seen below.
Matlab confusion matrix
Python confusion matrix
So if anyone has some experience or suggestions with svm.svc multiclass training and can see a problem in my code, or has a suggestion it would be greatly appreciated.
Python code:
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
#from sklearn import preprocessing
#### SET fitting parameters here
C = 100
gamma = 1e-8
#### SET WEIGHTS HERE
C0_Weight = 1*C
C1_weight = 1*C
C2_weight = 1*C
C3_weight = 1*C
C4_weight = 1*C
#####
X = np.genfromtxt('data/features.csv', delimiter=',')
Y = np.genfromtxt('data/targets.csv', delimiter=',')
print 'feature data is of size: ' + str(X.shape)
print 'target data is of size: ' + str(Y.shape)
# SPLIT X AND Y INTO TRAINING AND TEST SET
test_size = 0.3
X_train, x_test, Y_train, y_test = train_test_split(X, Y,
... test_size=test_size, random_state=0)
svc = svm.SVC(C=C,kernel='rbf', gamma=gamma, class_weight = {0:C0_Weight,
... 1:C1_weight, 2:C2_weight, 3:C3_weight, 4:C4_weight},cache_size = 1000)
svc.fit(X_train, Y_train)
scores = cross_val_score(svc, X_train, Y_train, cv=10)
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Out = svc.predict(x_test)
np.savetxt("data/testPredictions.csv", Out, delimiter=",")
np.savetxt("data/testTargets.csv", y_test, delimiter=",")
# calculate accuracy in test data
Hits = 0
HitsOverlap = 0
for idx, val in enumerate(Out):
Hits += int(y_test[idx]==Out[idx])
HitsOverlap += int(y_test[idx]==Out[idx]) + int(y_test[idx]==
... (Out[idx]-1)) + int(y_test[idx]==(Out[idx]+1))
print "Accuracy in testset: ", Hits*100/(11595*test_size)
print "Accuracy in testset w. overlap: ", HitsOverlap*100/(11595*test_size)
to those curious how i got the parameters, they were found with GridSearchCV (and increased the accuracy from 40% to 69)
Any help or suggestions is greatly appreciated.

After much pulling my hair, the answer was found here: http://neerajkumar.org/writings/svm/
when the inputs were scaled with StandardScaler(), svm.svc now produces superior results to matlab!!

Related

cross validation for split test and train datasets

Unlike standart data, I have dataset contain separetly as train, test1 and test2. I implemented ML algorithms and got performance metrics. But when i apply cross validation, it's getting complicated.. May be someone help me.. Thank you..
It's my code..
train = pd.read_csv('train-alldata.csv',sep=";")
test = pd.read_csv('test1-alldata.csv',sep=";")
test2 = pd.read_csv('test2-alldata.csv',sep=";")
X_train = train_pca_son.drop('churn_yn',axis=1)
y_train = train_pca_son['churn_yn']
X_test = test_pca_son.drop('churn_yn',axis=1)
y_test = test_pca_son['churn_yn']
X_test_2 = test2_pca_son.drop('churn_yn',axis=1)
y_test_2 = test2_pca_son['churn_yn']
For example, KNN Classifier.
knn_classifier = KNeighborsClassifier(n_neighbors =7,metric='euclidean')
knn_classifier.fit(X_train, y_train)
For K-Fold.
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score
dtc = DecisionTreeClassifier(random_state=42)
k_folds = KFold(n_splits = 5)
scores = cross_val_score(dtc, X, y, cv = k_folds)
print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
This is a variation on the "holdout test data" pattern (see also: Wikipedia: Training, Validation, Test / Confusion in terminology). For churn prediction: this may arise if you have two types of customers, or are evaluating on two time frames.
X_train, y_train ← perform training and hyperparameter tuning with this
X_test1, y_test1 ← test on this
X_test2, y_test2 ← test on this as well
Cross validation estimates holdout error using the training data—it may come up if you estimate hyperparameters with GridSearchCV. Final evaluation involves estimating performance on two test sets, separately or averaged over the two:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
X_test1, X_test2, y_test1, y_test2 = train_test_split(X_test, y_test, test_size=.5)
print(y_train.shape, y_test1.shape, y_test2.shape)
# (600,) (200,) (200,)
clf = KNeighborsClassifier(n_neighbors=7).fit(X_train, y_train)
print(f1_score(y_test1, clf.predict(X_test1)))
print(f1_score(y_test2, clf.predict(X_test2)))
# 0.819
# 0.805

I could not get the idea about difference between cross_val_score and accuracy_score

I am trying to understand cross validation score and accuracy score. I got accuracy score = 0.79 and cross validation score = 0.73. As I know, these scores should have been very close to each other. What can I say about my model by just looking at these scores ?
sonar_x = df_2.iloc[:,0:61].values.astype(int)
sonar_y = df_2.iloc[:,62:].values.ravel().astype(int)
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.ensemble import RandomForestClassifier
x_train,x_test,y_train,y_test=train_test_split(sonar_x,sonar_y,test_size=0.33,random_state=0)
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
folds = KFold(n_splits = 10, shuffle = False, random_state = 0)
scores = []
for n_fold, (train_index, valid_index) in enumerate(folds.split(sonar_x,sonar_y)):
print('\n Fold '+ str(n_fold+1 ) +
' \n\n train ids :' + str(train_index) +
' \n\n validation ids :' + str(valid_index))
x_train, x_valid = sonar_x[train_index], sonar_x[valid_index]
y_train, y_valid = sonar_y[train_index], sonar_y[valid_index]
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
scores.append(acc_score)
print('\n Accuracy score for Fold ' +str(n_fold+1) + ' --> ' + str(acc_score)+'\n')
print(scores)
print('Avg. accuracy score :' + str(np.mean(scores)))
##Cross validation score
scores = cross_val_score(rf, sonar_x, sonar_y, cv=10)
print(scores.mean())
You have a bug in your code that accounts for the gap.
You are training over a folded set of trains, but evaluating against a fixed test.
These two lines in the for loop:
y_pred = rf.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
Should be:
y_pred = rf.predict(x_valid)
acc_score = accuracy_score(y_pred , y_valid)
Since in your hand-written Cross-validation you are evaluating against a fixed x_test and y_test, for some folds there is a leakage that accounts for the overly optimistic result in the overall average.
If you correct this, the values should come closer, because you are doing the same, conceptually speaking, as cross_val_score does.
They might not match exactly though, due to randomness and the size of your dataset, which is quite small.
Finally, if you just wanted to the get one test score, then the KFold part is not needed and you can do:
x_train,x_test,y_train,y_test=train_test_split(sonar_x,sonar_y,test_size=0.33,random_state=0)
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
This result is less robust than the cross-validated results, since you are splitting the dataset just once and therefore you can get better or worse results by chance, this is, depending on the difficulty of the train-test split that the random seed generated.

How can accuracy differs between one_hot_encode and count_vectorizer for the same dataset?

onehot_enc, BernoulliNB:
Here, I have used two different files for reviews and labels and I've used "train_test_split" to randomly split the data into 80% train data and 20% test data.
reviews.txt:
Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear
labels.txt:
positive
negative
positive
negative
My Code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%
CountVectorizer, MultinomialNB:
Here, I've manually split the same data into train (80%) and test(20%).And I'm supplying these two csv files to the algorithm.
But, this is giving less accuracy compared to the above method. Can anyone help me out regarding the same ...
train_data.csv:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
test_data.csv:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
My Code:
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%
The issue here is that you are using MultiLabelBinarizer to:
onehot_enc.fit(reviews_tokens)`
before splitting into train and test, and test data is leaked to the model and hence higher accuracy.
On the other hand, when you use CountVectorizer is only seeing the trained data and then ignoring the words that dont appear in the trained data, which may be valuable to model for classification.
So depending on the quantity of your data, this can make a huge difference. Anyways, your second technique (using CountVectorizer) is correct and should be used in case of text data. MultiLabelBinarizer and one-hot encoding in general should be used only for categorical data, not text data.
Can you share your complete data?

How to run through loop to use non-scaled and scaled data in python for loop

I have the following code running through and fitting a model on the iris data using different modeling techniques. How can I add a second step in this process so I can demonstrate the improvement between using scaled and non-scaled data?
I don't need to run the scale transform outside of the loop, i was just having a lot of issues with transforming the data type from pandas dataframe to np array and back again.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
sc = StandardScaler()
X_train_scale = sc.fit_transform(X_train)
X_test_scale = sc.transform(X_test)
numFolds = 10
kf = KFold(len(y_train), numFolds, shuffle=True)
# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, svm.SVC]
params = [{},{}]
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
So ideally my output would be:
CV standard accuracy score of LogisticRegression: 0.683333
CV scaled accuracy score of LogisticRegression: 0.766667
CV standard accuracy score of SVC: 0.766667
CV scaled accuracy score of SVC: 0.783333
It seems like this is unclear, I am trying to loop through scaled and unscaled data, similar to how I am looping through the different ML algorithms.
I wanted to follow up with this. I was able to do this by creating a pipeline and using gridsearchCV
pipe = Pipeline([('scale', StandardScaler()),
('clf', LogisticRegression())])
param_grid = [{
'scale':[None,StandardScaler()],
'clf':[SVC(),LogisticRegression()]}]
grid_search = GridSearchCV(pipe, param_grid=param_grid,n_jobs=-1, verbose=1 )
In the end this got me the results I wanted and was able to test easily how to work between scaling and not scaling.
try this:
from __future__ import division
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
# added to your code
if previous_accuracy:
improvement = 1 - (accuracy / previous_accuracy)
print "CV accuracy score improved by", improvement
else:
previous_accuracy = accuracy

Large mean squared error in sklearn regressors

I'm a beginner in machine learning and I want to build a model to predict the price of houses. I prepared a dataset by crawling a local housing website and it consists 1000 samples and only 4 features (latitude, longitude, area and number of rooms).
I tried RandomForestRegressor and LinearSVR models in sklearn, but I can't train the model properly and the MSE is super high.
MSE almost equals 90,000,000 (the true values of prices' range are between 5,000,000 - 900,000,000)
Here is my code:
import numpy as np
from sklearn.svm import LinearSVR
import pandas as pd
import csv
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
df = pd.read_csv('dataset.csv', index_col=False)
X = df.drop('price', axis=1)
X_data = X.values
Y_data = df.price.values
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=5)
rgr = RandomForestRegressor(n_estimators=100)
svr = LinearSVR()
rgr.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
MSEs = cross_val_score(estimator=rgr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEsSVR = cross_val_score(estimator=svr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEs *= -1
RMSEs = np.sqrt(MSEs)
print("Root mean squared error with 95% confidence interval:")
print("{:.3f} (+/- {:.3f})".format(RMSEs.mean(), RMSEs.std()*2))
print("")
Is the problem with my dataset and count of features? How can I build a prediction model with this type of dataset?

Categories

Resources