Cross validation inconsistent numbers of samples error (Python) - python

I am trying to make a classification using cross validation method and SVM classifier. In my data file, the last column contains my classes (which are 0, 1, 2, 3, 4, 5) and the rest (except first column) is the numeric data that I want to use to predict these classes.
from sklearn import svm
from sklearn import metrics
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
filename = "Features.csv"
dataset = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=range(1, 39))
x = dataset[:, 0:36]
y = dataset[:, 36]
print("len(x): " + str(len(x)))
print("len(y): " + str(len(x)))
skf = StratifiedKFold(n_splits=10, shuffle=False, random_state=42)
modelsvm = svm.SVC()
expected = y
print("len(expected): " + str(len(expected)))
predictedsvm = cross_val_score(modelsvm, x, y, cv=skf)
print("len(predictedsvm): " + str(len(predictedsvm)))
svm_results = metrics.classification_report(expected, predictedsvm)
print(svm_results)
And I am getting such an error:
len(x): 2069
len(y): 2069
len(expected): 2069
C:\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
FutureWarning
len(predictedsvm): 10
Traceback (most recent call last):
File "C:/Users/MyComp/PycharmProjects/GG/AR.py", line 54, in <module>
svm_results = metrics.classification_report(expected, predictedsvm)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
return f(**kwargs)
File "C:\Python\Python37\lib\site-packages\sklearn\metrics\_classification.py", line 1929, in classification_report
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Python\Python37\lib\site-packages\sklearn\metrics\_classification.py", line 81, in _check_targets
check_consistent_length(y_true, y_pred)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", line 257, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [2069, 10]
Process finished with exit code 1
I don't understand how my data count in y goes down to 10 when I am trying to predict it using CV.
Can anyone help me on this please?

You are misunderstanding the output from cross_val_score. As per the documentation it returns "array of scores of the estimator for each run of the cross validation," not actual predictions. Because you have 10 folds, you get 10 values.
classification_report expects the true values and the predicted values. To use this, you'll want to predict with a model. To do this, you'll need to fit the model on the data. If you're happy with the results from cross_val_score you can train that model on the data. Or, you can use GridSearchCV to do this all in one sweep.

Related

I want to use GridSearchCV to find best parameters and predict two methods

My code is below:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_moons
n = 300
n_tr = 200
X, y = make_moons(n, shuffle=True, noise=0.2, random_state=112)
y[y==0] = -1
# standardise the data
trmean = np.mean(X[:n_tr, :], axis=0)
trvar = np.var(X[:n_tr, :], axis=0)
X = (X - trmean[np.newaxis, :]) / np.sqrt(trvar)[np.newaxis, :]
# take first n_tr as training, others as test.
Xtr = X[:n_tr, :]
ytr = y[:n_tr]
Xt = X[n_tr:, :]
yt = y[n_tr:]
# the parameters to consider in cross-validation
# order lasso params from largest to smallest so that if ties occur CV will select the one with more sparsity
params_lasso = [1, 1e-1, 1e-2, 1e-3, 1e-4]
params_rbf = [1e-3, 1e-2, 1e-1, 1, 10, 100]
params_svm = [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000] # smaller C: more regularisation
# the noisy features to be considered; for n noisy features, add the first n columns
n_tot_noisy_feats = 50
np.random.seed(92)
X_noise = 2*np.random.randn(X.shape[0], n_tot_noisy_feats)
Xtr_noise = X_noise[:n_tr, :]
Xt_noise = X_noise[n_tr:, :]
svm_accs=[]
lasso_accs=[]
lasso_n_feats=[]
amount_of_noise_feature=np.arange(n_tot_noisy_feats+1)
k=5
for n_nosie_feats in amount_of_noise_feature:
print(n_nosie_feats)
Xtr_noisy = np.copy(Xtr)
Xt_nosiy = np.copy(Xt)
if n_nosie_feats > 0:
Xtr_noisy = np.hstack((Xtr_noisy, Xtr_noise[:,:n_nosie_feats]))
Xt_nosiy = np.hstack((Xt_nosiy, Xt_noise[:,:n_nosie_feats]))
lasso_valid_accs = np.zeros((k, len(params_lasso)))
svm_valid_accs = np.zeros((k, len(params_rbf), len(params_svm)))
from sklearn.model_selection import GridSearchCV
cscv_lasso=GridSearchCV(Lasso(),{"alpha":params_lasso},scoring='accuracy',cv=KFold(5))
cscv_lasso.fit(Xtr_noisy,ytr)
cscv_lassopreds=np.sign(cscv_lasso.best_estimator_.predict(Xt_nosiy))
cscv_svm = GridSearchCV(SVC(), {"C": params_svm,"gamma":params_rbf},scoring='accuracy', cv=KFold(5))
cscv_svm.fit(Xtr_noisy, ytr)
Which results in this error message:
UserWarning,
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:700: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py", line 687, in _score
scores = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_scorer.py", line 200, in __call__
sample_weight=sample_weight)
File "/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_scorer.py", line 243, in _score
**self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py", line 202, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py", line 93, in _check_targets
"and {1} targets".format(type_true, type_pred))
ValueError: Classification metrics can't handle a mix of binary and continuous targets
What is the lowest amount of noisy features added to the original data, when Lasso accuracy on test set is better than SVM?
It seems that the the predict is weird.
best lasso parameters: {'alpha': 0.1}
best svm params: {'C': 100, 'gamma': 0.001}
I want to find the best param for Lasso and SVC, and the corresponding accuracy_score when noisy features added to the original data.

KNN : Found input variables with inconsistent numbers of samples: [20, 499]

Full replit here: https://repl.it/#JacksonEnnis/KNNPercentage
I am trying to use the KNN tool from sci-kit learn to make some predictions.
I have two functions, recurse() and predict(). recurse() is intended to iterate through every single possible combo of features, while predict is supposed to do the actual
def predict(self, data, answers):
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split as tts
import numpy as np
if len(data) > 1:
print("length before transposition {}".format(len(data)))
#n_data = np.transpose(data)
#print("length after transposition {}".format(len(n_data)))
knn = KNeighborsClassifier(n_neighbors=1)
xTrain, xTest, yTrain, yTest = tts(data, answers)
print("xTrain data: {}".format(len(xTrain)))
knn.fit(xTrain, yTrain)
print(knn.score(xTest, yTest))
def recurse(self, data):
self.predict(data, self.y)
if len(data) > 0:
self.recurse(self.rLeft(data))
if len(data) > 1:
self.recurse(self.rMid(data))
if len(data) > 2:
self.recurse(self.rRight(data))
However, when I run the program, it states that is has a problem with the train/test line. I have checked the samples in each feature, as well as the answers, and found that they are all the same length, so why this is happening I am unsure.
Traceback (most recent call last):
File "main.py", line 12, in <module>
best = Config(apple)
File "/home/runner/Config.py", line 13, in __init__
self.predict(self.features, self.y)
File "/home/runner/Config.py", line 45, in predict
xTrain, xTest, yTrain, yTest = tts(data, answers)
File "/home/runner/.local/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 2096, in train_test_split
arrays = indexable(*arrays)
File "/home/runner/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 230, in indexable
check_consistent_length(*result)
File "/home/runner/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [20, 499]
You have your axes reversed. The format is that for each of your arrays, array.shape[0] must be the same size. I recommend you check out the scikit docs for more examples.
tts(np.array(data).T, answers)

ValueError: "metrics can't handle a mix of binary and continuous targets" with no source

I'm a beginner in Machine Learning and I'm trying to learn by working through Kaggle's Titanic problem. From what I know, I've made sure that the metrics are in sync with one another but of course I blame myself for this problem and not Python. However, I still couldn't find the source and Spyder IDE is no help.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
"""Assigning the train & test datasets' adresses to variables"""
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"
"""Using pandas' read_csv() function to read the datasets
and then assigning them to their own variables"""
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)
"""Using pandas' factorize() function to represent genders (male/female)
with binary values (0/1)"""
train_data['Sex'] = pd.factorize(train_data.Sex)[0]
test_data['Sex'] = pd.factorize(test_data.Sex)[0]
"""Replacing missing values in the training and test dataset with 0"""
train_data.fillna(0.0, inplace = True)
test_data.fillna(0.0, inplace = True)
"""Selecting features for training"""
columns_of_interest = ['Pclass', 'Sex', 'Age']
"""Dropping missing/NaN values from the training dataset"""
filtered_titanic_data = train_data.dropna(axis=0)
"""Using the predictory features in the data to handle the x axis"""
x = filtered_titanic_data[columns_of_interest]
"""The survival (what we're trying to find) is the y axis"""
y = filtered_titanic_data.Survived
"""Splitting the train data with test"""
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
"""Assigning the DecisionTreeRegressor model to a variable"""
titanic_model = DecisionTreeRegressor()
"""Fitting the x and y values with the model"""
titanic_model.fit(train_x, train_y)
"""Predicting the x-axis"""
val_predictions = titanic_model.predict(val_x)
"""Assigning the feature columns from the test to a variable"""
test_x = test_data[columns_of_interest]
"""Predicting the test by feeding its x axis into the model"""
test_predictions = titanic_model.predict(test_x)
"""Printing the prediction"""
print(val_predictions)
"""Checking for the accuracy"""
print(accuracy_score(val_y, val_predictions))
"""Printing the test prediction"""
print(test_predictions)
and this is the stacktrace:
Traceback (most recent call last):
File "<ipython-input-3-73797c87986e>", line 1, in <module>
runfile('C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py', wdir='C:/Users/Omar/Downloads/Kaggle Competition')
File "C:\Users\Omar\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Omar\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py", line 58, in <module>
print(accuracy_score(val_y, val_predictions))
File "C:\Users\Omar\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 176, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Users\Omar\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 81, in _check_targets
"and {1} targets".format(type_true, type_pred))
ValueError: Classification metrics can't handle a mix of binary and continuous targets
You are trying to use a regression algorithm (DecisionTreeRegressor) for a binary classification problem; the regression model, as expected, gives continuous outputs, but the accuracy_score, where the error actually happens:
File "C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py", line 58, in <module>
print(accuracy_score(val_y, val_predictions))
expects binary ones, hence the error.
For starters, change your model to
from sklearn.tree import DecisionTreeClassifier
titanic_model = DecisionTreeClassifier()
You are using a DecisionTreeRegressor, which as it says, is a regressor model. The Kaggle Titanic problem is a classification problem. So you should use a DecisionTreeClassifier.
As for why your code is throwing an error, it is because val_y has binary values (0,1) whereas val_predictions has continuous values because you used a Regressor model.
Classification needs discrete label as it predicts class(which is any one of the label) and Regression works with continuous data. As your output is class label, you need to perform classification

What does ValueError: Found input variables with inconsistent numbers of samples: [75, 1] signify in Python?

I was going through the "Writing Our First Classifier - Machine Learning Recipes #5" Machine Learning Video on Youtube. I followed along the example, but am not sure why I am not able to get the code running.
Note: This isn't the final code for the KNN Classifier. It is the initial testing phase.
#implementing KNN Classifier without using import statement
import random
class ScrappyKNN():
def fit(self, X_train, Y_train):
self.X_train=X_train
self.Y_train=Y_train
def predict(self, X_test):
predictions=[]
for row in X_test:
label = random.choice(self.Y_train)
predictions.append(label)
return predictions
from sklearn.datasets import load_iris
iris=load_iris()
X=iris.data
Y=iris.target
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.5)
clf=ScrappyKNN()
clf.fit (X_train,Y_train)
predictions_result=clf.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(Y_test,predictions_result))
I am getting the error "ValueError: Found input variables with inconsistent numbers of samples: [75, 1]". I believe there is some size inconsistency in the list as training and testing data sets are split out of the 150 into 75 samples each (I have used test_size=0.5). I am really stuck in this one. Could you Kindly tell what this error means. I searched through similar answers on stack overflow but unfortunately, can't make out what causes this error. I'm new to using Python for Machine Learning.Can someone Kindly help me out?
Here is the full Stacktrace
/Users/joyjitchatterjee/anaconda3/envs/machinelearning/bin/python /Users/joyjitchatterjee/PycharmProjects/untitled1/ml_5.py
Traceback (most recent call last):
File "/Users/joyjitchatterjee/PycharmProjects/untitled1/ml_5.py", line 36, in <module>
print(accuracy_score(Y_test,predictions_result))
File "/Users/joyjitchatterjee/anaconda3/envs/machinelearning/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/Users/joyjitchatterjee/anaconda3/envs/machinelearning/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 71, in _check_targets
check_consistent_length(y_true, y_pred)
File "/Users/joyjitchatterjee/anaconda3/envs/machinelearning/lib/python3.6/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [75, 4]
Process finished with exit code 1
Screenshot of code
The indentation of the final return statement is wrong in your code. It should be
def predict(self, X_test):
predictions=[]
for row in X_test:
label = random.choice(self.Y_train)
predictions.append(label)
return predictions

scikit-learn ValueError: dimension mismatch

This is my first time posting here. For the past couple of days I have been trying to teach myself scikit-learn. But recently I have encountered an error that has been nagging me for quite some time.
My goal is simply to train a NB classifier cli so that I can feed it an arbitrary list of strings called new_doc and it will predict what class the string is likely to belong to.
This is what my program looks like:
#Importing stuff
import numpy as np
import pylab
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn import metrics
#Opening the csv file
df = pd.read_csv('data.csv', sep=',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
#Extracting features from text, define target y and data X
vect = CountVectorizer()
X = vect.fit_transform(df['Features'])
y = df['Target']
#Partitioning the data into test and training set
SPLIT_PERC = 0.75
split_size = int(len(y)*SPLIT_PERC)
X_train = X[:split_size]
X_test = X[split_size:]
y_train = y[:split_size]
y_test = y[split_size:]
#Training the model
clf = MultinomialNB()
clf.fit(X_train, y_train)
#Evaluating the results
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
#Predicting new data
new_doc = ["MacDonalds", "Walmart", "Target", "Starbucks"]
trans_doc = vect.transform(new_doc) #extracting features
y_pred = clf.predict(trans_doc) #predicting
But when I run the program I get the following error on the last row:
y_pred = clf.predict(trans_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 62, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 441, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 175, in safe_sparse_dot
ret = a * b
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/base.py", line 334, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
So apparently it has something to do with the dimension of the term-document matrixes.
When I check the dimensions of trans_doc, X_train and X_test i get:
>>> trans_doc.shape
(4, 4)
>>> X_train.shape
(145314, 28750)
>>> X_test.shape
(48439, 28750)
In order for y_pred = clf.predict(trans_doc) to work I need to (from what I understand it) transform new_doc into a term-document matrix with the dimensions (4, 28750). But I don't know of any methods within CountVectorizer that lets me do this.

Categories

Resources