I am trying to classify images using sklearn's svm.SVC classifier, but it's not learning, after training I got 0.1 accuracy (there are 10 classes, so 0.1 accuracy is the same as a random guess)
I am using the CIFAR-10 datatset. 10000 images that are represented as 3072 uint8s. The first 1024 are the red pixels, the second 1024 are the green pixels and the thirst 1024 are the blue pixels.
Each image also has a label, which is a number 0-9
Here is my code:
import numpy as np
from sklearn import preprocessing, svm
import pandas as pd
import pickle
from sklearn.externals import joblib
train_data = pickle.load(open('data_batch_1','rb'), encoding='latin1')
test_data = pickle.load(open('test_batch','rb'), encoding='latin1')
X_train = np.array(train_data['data'])
y_train = np.array(train_data['labels'])
X_test = np.array(test_data['data'])
y_test = np.array(test_data['labels'])
clf = svm.SVC(verbose=True)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
joblib.dump(clf, 'Cifar-10-clf.pickle')
print(accuracy)
Does anyone know what my problem could be or can point me to resources to solve this?
I'm not sure but I think that you need to tune the parameters of SVC.
I tested some parameters for learning then I got an 0.318 accuracy.
here is code:
# coding: utf-8
import numpy as np
from sklearn import preprocessing, svm
import cPickle
train_data = cPickle.load(open('data/data_batch_1', 'rb'))
test_data = cPickle.load(open('data/test_batch', 'rb'))
X_train = np.array(train_data['data'])
y_train = np.array(train_data['labels'])
X_test = np.array(test_data['data'][:1000])
y_test = np.array(test_data['labels'][:1000])
clf = svm.SVC(kernel='linear', C=10, gamma=0.01)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print "Accuracy: ", accuracy
And I recommend grid search function for auto tuning the hyper-parameters.
This is public documents about tuning the hyper-parameters in scikit-learn
Related
I'm trying to use the following code to get feature importance for feature selection in XGBOOST. But I keep getting an error saying that there's a "Feature shape mismatch, expected: 91, got 78".
"select_X_train" has 78 features (columns) as does "select_X_test" so I think the issue is that the thresholds array must still contain the 91 original variables? But, I can't figure out a way to fix it. Can you please advise?
Here is the code:
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)```
My model uses feature importance for feature selection with XGBOOST. But, at the end, it outputs all the confusion matrices/results and how many features the model includes. That now works successfully, but I also need to have the feature names that were used in each model outputted as well.
I get a warning that says "X has feature names, but SelectFromModel was fitted without feature names", so I know something needs to be added to have them be in the model before I can output them, but I'm not sure how to handle either of those steps. I found several old questions about this, but I wasn't able to successfully implement any of them to my particular code. I'd really appreciate any ideas you have. Thank you!
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)
I'm implementing Naive Bayes by sklearn with imbalanced data.
My data has more than 16k records and 6 output categories.
I tried to fit the model with the sample_weight calculated by sklearn.utils.class_weight
The sample_weight received something like:
sample_weight = [11.77540107 1.82284768 0.64688602 2.47138047 0.38577435 1.21389195]
import numpy as np
data_set = np.loadtxt("./data/_vector21.csv", delimiter=",")
inp_vec = data_set[:, 1:22]
out_vec = data_set[:, 22:]
#
# # Split dataset into training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inp_vec, out_vec, test_size=0.2) # 80% training and 20% test
#
# class weight
from keras.utils.np_utils import to_categorical
output_vec_categorical = to_categorical(y_train)
from sklearn.utils import class_weight
y_ints = [y.argmax() for y in output_vec_categorical]
c_w = class_weight.compute_class_weight('balanced', np.unique(y_ints), y_ints)
cw = {}
for i in set(y_ints):
cw[i] = c_w[i]
# Create a Gaussian Classifier
from sklearn.naive_bayes import *
model = GaussianNB()
# Train the model using the training sets
print(c_w)
model.fit(X_train, y_train, c_w)
# Predict the response for test dataset
y_pred = model.predict(X_test)
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("\nClassification Report: \n", (metrics.classification_report(y_test, y_pred)))
print("\nAccuracy: %.3f%%" % (metrics.accuracy_score(y_test, y_pred)*100))
I got this message:
ValueError: Found input variables with inconsistent numbers of samples: [13212, 6]
Can anyone tell me what did I do wrong and how can fix it?
Thanks a lot.
The sample_weight and class_weight are two different things.
As their name suggests:
sample_weight is to be applied to individual samples (rows in your data). So the length of sample_weight must match the number of samples in your X.
class_weight is to make the classifier give more importance and attention to the classes. So the length of class_weight must match the number of classes in your targets.
You are calculating class_weight and not sample_weight by using the sklearn.utils.class_weight, but then try to pass it to the sample_weight. Hence the dimension mismatch error.
Please see the following questions for more understanding of how these two weights interact internally:
What is the difference between sample weight and class weight options in scikit learn?
https://stats.stackexchange.com/questions/244630/difference-between-sample-weight-and-class-weight-randomforest-classifier
This way I was able to calculate the weights to deal with class imbalance.
from sklearn.utils import class_weight
sample = class_weight.compute_sample_weight('balanced', y_train)
#Classifier Naive Bayes
naive = naive_bayes.MultinomialNB()
naive.fit(X_train,y_train, sample_weight=sample)
predictions_NB = naive.predict(X_test)
I am tring to use Random Forest Classifier from scikit learn in Python to predict stock movements. My dataset has 8 features and 1201 records. But after fitting the model and using it to predict, it appears 100% of accuracy and 100% of OOB error. I modified the n_estimators from 100 to a small value, but the OOB error has just dropped few %. Here is my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
#File reading
df = pd.read_csv('700.csv')
df.drop(df.columns[0],1,inplace=True)
target = df.iloc[:,8]
print(target)
#train test split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.3)
#model fit
clf = RandomForestClassifier(n_estimators=100, criterion='gini',oob_score= True)
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
accuaracy = accuracy_score(y_test,pred)
print(clf.oob_score_)
print(accuaracy)
How can I modifiy the code in order to make the oob error drop? Thanks.
If you want to check the error then use/modify your code like this one :
oob_error = 1 - clf.oob_score_
I'm trying to implement GRNN with MNIST handwritten digit dataset using python,
here is my code, i'm getting Predicted values as NaN
import numpy as np
from sklearn import datasets, preprocessing
from sklearn.model_selection import train_test_split
from neupy import algorithms
#import sys
print('\nLoading...')
traindata = np.genfromtxt('./MNIST_Dataset_Loader/dataset/mnist_train.csv', skip_header=55000,delimiter=',')
#testdata=np.genfromtxt('./MNIST_Dataset_Loader/dataset/mnist_test.csv',skip_header=9000, delimiter=',')
# Load MNIST Data
print('\nLoading MNIST Data...')
x_train = traindata[:,1:]
y_train = traindata[:,0]
print('\nLoading Testing Data...')
#x_test = testdata[:,1:]
#y_test = testdata[:,0]
x_train, x_test, y_train, y_test = train_test_split(preprocessing.minmax_scale(x_train),preprocessing.minmax_scale(y_train),test_size=0.3)
print("training")
nw = algorithms.GRNN(std=0.1)
nw.train(x_train, y_train)
#nw.fit(x_train, y_train)
print("Predicting")
y_predicted = nw.predict(x_test)
print(y_predicted)
mse = np.mean((y_predicted - y_test) ** 2)
#print(mse)