X_train, X_test, y_train, y_test = train_test_split(features, df['Label'], test_size=0.2, random_state=111)
print (X_train.shape) # (540, 4196)
print (X_test.shape) # (136, 4196)
print (y_train.shape) # (540,)
print (y_test.shape) # (136,)
When fitting, it gives error:
from sklearn.svm import SVC
classifier = SVC(random_state = 0)
classifier.fit(features,y_train)
y_pred = classifier.predict(features)
Error:
ValueError: Found input variables with inconsistent numbers of samples: [676, 540]
I tried this.
You want to call the fit function with you X_train, not with features. The error occurs because features and y_train don't have the same size.
X_train, X_test, y_train, y_test = train_test_split(features, df['Label'], test_size=0.2, random_state=111)
print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)
from sklearn.svm import SVC
classifier = SVC(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
You'll likely also want to call predict with X_test or X_train. You may want to learn a bit more about train/test splits and why they are used.
Why are you using the features along y_train for the .fit()? I think you are supposed to use X_train instead.
Instead of
classifier.fit(features, y_train)
Use:
classifier.fit(X_train, y_train)
You are trying to use two sets of data with different shape, since you did the split earlier. So features has more samples than y_train.
Also, for you predict line. It should be:
.predict(x_test)
Related
I have a task that requires me to analyse a model but I need the output predictions for each cross validation step- and the data that the cross validation used in that step.
Here is my code:
results= cross_validate(MLPClassifier, X_train, y_train, cv=5,return_estimator = True)
Which did not work. Also,
results= cross_val_predict(MLPClassifier, X_train, y_train, cv=5)
Neither worked, however the second method gave me the a set of predictions that are the shape of y_train (labels). However I expected a smaller value to be returned say 10% the size of y_train.
Also I'm unsure how to obtain the data used for each cross validation step.
How about using one of the Cross Validation iterators?
from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.neural_network import MLPClassifier
X, y = make_classification(n_samples=1000, random_state=0)
datasets = {} # [(X_train, y_train), (X_test, y_test)]
results = {}
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for idx, (train_index, test_index) in enumerate(ss.split(X)):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
datasets[f"train_{idx}"] = X_train, y_train
datasets[f"test_{idx}"] = X_test, y_test
model = MLPClassifier(random_state=0).fit(X_train, y_train)
results[f"accuracy_{idx}"] = model.score(X_test, y_test)
results
Output:
{'accuracy_0': 0.968,
'accuracy_1': 0.924,
'accuracy_2': 0.94,
'accuracy_3': 0.944,
'accuracy_4': 0.964}
I have a challenge using the sklearn 70-30 division. I receive an error on line:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
The error is:
Found input variables with inconsistent numbers of samples
Context
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors = 1)
X = data.drop('cluster',axis=1)
y = data['cluster']
X_smote, y_smote= sm.fit_sample(X,y)
data_bal = pd.DataFrame(columns=X.columns.values, data=X_smote)
data_bal['cluster']=y_smote
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
y_train.value_counts().plot(kind='bar')
Edit
I solve the error, I just had to put the stratify=y in stratify=y_smote
Just an observation in your line of code:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
The error thrown typically is a result of some input value that is expected to have a particular dimension or length that is consistent with other input values.
Check the length and/or dimensions of X_smote, y_smote and y to see if they are all as expected.
I got the same Issue but when I changed
x_train,y_train,x_test,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
to
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
my error got removed.
Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?
You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.
Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)
Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
else:
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski
I want to run several regression types (Lasso, Ridge, ElasticNet and SVR) on a dataset with around 5,000 rows and 6 features. Linear regression. Use GridSearchCV for cross validation. The code is extensive but here are some critical parts:
def splitTrainTestAdv(df):
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
#Scaling and Sampling
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
return X_train, X_test, y_train, y_test
def performSVR(x_train, y_train, X_test, parameter):
C = parameter[0]
epsilon = parameter[1]
kernel = parameter[2]
model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
model.fit(x_train, y_train)
return model.predict(X_test) #prediction for the test
def performRidge(X_train, y_train, X_test, parameter):
alpha = parameter[0]
model = linear_model.Ridge(alpha=alpha, normalize=True)
model.fit(X_train, y_train)
return model.predict(X_test) #prediction for the test
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.95]}
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.01]}
),
)
}
def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):
print("# Tuning hyper-parameters for %s" % feature)
print()
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)
gs.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(gs.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gs.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))
soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)
X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)
scores = ['precision', 'recall']
for score in scores:
for model in MODELS.keys():
print '####################'
print model, score
print '####################'
performParameterSelection(model, score, X_test, y_test, X_train, y_train)
You can assume that all required imports are done
I am getting this error and do not know why:
ValueError Traceback (most recent call last)
in ()
18 print model, score
19 print '####################'
---> 20 performParameterSelection(model, score, X_test, y_test, X_train, y_train)
21
<ipython-input-27-304555776e21> in performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train)
12 # cv=5 - constant; verbose - keep writing
13
---> 14 gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
15
16 #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
90 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
91 "multilabel-sequences"]):
---> 92 raise ValueError("{0} is not supported".format(y_type))
93
94 if y_type in ["binary", "multiclass"]:
ValueError: continuous-multioutput is not supported
I am still very new to Python and this error puzzles me. This should not because I have 6 features, of course. I tried to follow standard buil-in functions.
Please, help
First let's replicate the problem.
First import the libraries needed:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
Then create some data:
df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
Now we can replicate the error and also see options which do not replicate the error:
This runs OK
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train)
This does not
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train)
and indeed the error is exactly as you have above; 'continuous multi-output is not supported'.
If you think about the recall measure, it is to do with binary or categorical data - about which we can define things like false positives and so on. At least in my replication of your data, I used continuous data and recall simply is not defined. If you use the default score it works, as you can see above.
So you probably need to look at your predictions and understand why they are continuous (i.e. use a classifier instead of regression). Or use a different score.
As an aside, if you run the regression with only one set of (column of) y values, you still get an error. This time it says more simply 'continuous output is not supported', i.e. the issue is using recall (or precision) on continuous data (whether or not it is multi-output).
The end goal is to evaluate the performance of the model, you can use the model.evaluate method:
_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))
This will give you the accuracy value.
Make sure you have single series for the dependent variable. Properly split your data in train_test_split.