GridSearch for Multilabel OneVsRestClassifier? - python

I'm doing a grid search over multilabel data as follows:
#imports
from sklearn.svm import SVC as classifier
from sklearn.pipeline import Pipeline
from sklearn.decomposition import RandomizedPCA
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
#classifier pipeline
clf_pipeline = clf_pipeline = OneVsRestClassifier(
Pipeline([('reduce_dim', RandomizedPCA()),
('clf', classifier())
]
))
C_range = 10.0 ** np.arange(-2, 9)
gamma_range = 10.0 ** np.arange(-5, 4)
n_components_range = (10, 100, 200)
degree_range = (1, 2, 3, 4)
param_grid = dict(estimator__clf__gamma=gamma_range,
estimator__clf__c=c_range,
estimator__clf__degree=degree_range,
estimator__reduce_dim__n_components=n_components_range)
grid = GridSearchCV(clf_pipeline, param_grid,
cv=StratifiedKFold(y=Y, n_folds=3), n_jobs=1,
verbose=2)
grid.fit(X, Y)
I'm seeing the following traceback:
/Users/andrewwinterman/Documents/sparks-honey/classifier/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit_grid_point(X, y, base_clf, clf_params, train, test, loss_func, score_func, verbose, **fit_params)
107
108 if y is not None:
--> 109 y_test = y[safe_mask(y, test)]
110 y_train = y[safe_mask(y, train)]
111 clf.fit(X_train, y_train, **fit_params)
TypeError: only integer arrays with one element can be converted to an index
Looks like GridSearchCV objects to multiple labels. How should I work around this? Do I need to explicitly iterate through the unique classes with label_binarizer, running grid search on each sub-estimator?

I think there is a bug in grid_search.py
Have you tried to give y as numpy array?
import numpy as np
Y = np.asarray(Y)

Related

How to apply KNN from large datatset to small dataset or to just one test data

I have trained and tested a KNN model on a supervised dataset of about 180 samples (6 classes of 30 samples each) in Python. I would like to apply these results to a small unsupervised dataset of 21 samples (3 classes of 7 samples).
The problem is datasets have different number of raws. So either I getting an error with inconsistent numbers of samples, or matching target in a new datasets and getting not representative result.
I want to see which classes datas from new small dataset corespond in large dataset. Is there a way to do that?
Here is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import utils
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
clf = KNeighborsClassifier()
for key in data:
scores = cross_val_score(clf, data[key], y, cv=5)
print("Accuracy for {:5s} : {:0.2f} (+/- {:0.2f})".format(
key, scores.mean(), scores.std() * 2))
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('small dataset')
X = df.drop(columns=['subject', 'sessionIndex', 'rep'])
y = df['subject']
Y = pd.get_dummies(y).values
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
n_neighbors = [2, 3, 4, 5, 6]
parameters = dict(n_neighbors=n_neighbors)
clf = KNeighborsClassifier()
grid = GridSearchCV(clf, parameters, cv=5)
grid.fit(X_train, Y_train)
results = grid.cv_results_
for i in range(1, 4):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {}".format(results['params'][candidate]))
print()
from sklearn.metrics import accuracy_score, roc_curve, auc
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
So I'm getting an array [[0 0 1]] which is correct, only it doesn't check any classes in large dataset of 6 classes like if I matching X and Y to datas from it, not from small dataset
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
X = data['large dataset']
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
This way the result an a array of 6 numbers like [[0 0 0 0 0 1]]. And I want to see the same when testing new small dataset.

Using StandardScaler as Preprocessor in Mlens Pipeline generates Classification Warning

I am trying to scale my data within the crossvalidation folds of a MLENs Superlearner pipeline. When I use StandardScaler in the pipeline (as demonstrated below), I receive the following warning:
/miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details:
ValueError("Classification metrics can't handle a mix of binary and continuous-multioutput targets")
(name, inst_name, exc), MetricWarning)
Of note, when I omit the StandardScaler() the warning disappears, but the data is not scaled.
breast_cancer_data = load_breast_cancer()
X = breast_cancer_data['data']
y = breast_cancer_data['target']
from sklearn.model_selection import train_test_split
X, X_val, y, y_val = train_test_split(X, y, test_size=.3, random_state=0)
from sklearn.base import BaseEstimator
class RFBasedFeatureSelector(BaseEstimator):
def __init__(self, n_estimators):
self.n_estimators = n_estimators
self.selector = None
def fit(self, X, y):
clf = RandomForestClassifier(n_estimators=self.n_estimators, random_state = RANDOM_STATE, class_weight = 'balanced')
clf = clf.fit(X, y)
self.selector = SelectFromModel(clf, prefit=True, threshold = 0.001)
def transform(self, X):
if self.selector is None:
raise AttributeError('The selector attribute has not been assigned. You cannot call transform before first calling fit or fit_transform.')
return self.selector.transform(X)
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
N_FOLDS = 5
RF_ESTIMATORS = 1000
N_ESTIMATORS = 1000
RANDOM_STATE = 42
from mlens.metrics import make_scorer
from sklearn.metrics import roc_auc_score, balanced_accuracy_score
accuracy_scorer = make_scorer(balanced_accuracy_score, average='micro', greater_is_better=True)
from mlens.ensemble.super_learner import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
ensemble = SuperLearner(folds=N_FOLDS, shuffle=True, random_state=RANDOM_STATE, n_jobs=10, scorer=balanced_accuracy_score, backend="multiprocessing")
preprocessing1 = {'pipeline-1': [StandardScaler()]
}
preprocessing2 = {'pipeline-1': [RFBasedFeatureSelector(N_ESTIMATORS)]
}
estimators = {'pipeline-1': [RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE, class_weight='balanced'),
MLPClassifier(hidden_layer_sizes=(10, 10, 10), activation='relu', solver='sgd',
max_iter=5000)
]
}
ensemble.add(estimators, preprocessing2, preprocessing1)
ensemble.add_meta(LogisticRegression(solver='liblinear', class_weight = 'balanced'))
ensemble.fit(X,y)
yhat = ensemble.predict(X_val)
balanced_accuracy_score(y_val, yhat)```
>Error text: /miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details:
ValueError("Classification metrics can't handle a mix of binary and continuous-multioutput targets")
(name, inst_name, exc), MetricWarning)
You are currently passing your preprocessing steps as two separate arguments when calling the add method.
You can instead combine them as follows:
preprocessing = {'pipeline-1': [RFBasedFeatureSelector(N_ESTIMATORS),StandardScaler()]}
Please refer to the documentation for the add method found here:
https://mlens.readthedocs.io/en/0.1.x/source/mlens.ensemble.super_learner/

TypeError: Cannot clone object '<>' (type <class ''>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods

I want to use votingClassifier or EnsembleVoteClassifier voting method with 3 different models but I have this error
I need your help to solve this problem!
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.plotting import plot_decision_regions
# Initializing Classifiers
clf1 = modelvgg16
clf2 = AlexNetModel
clf3 = InceptionV3Model
for model in [clf1, clf2,clf3]:
model._estimator_type = "classifier"
#print(model._estimator_type)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2,clf3],weights=[2, 1, 1], voting='soft')
X, Y = training_set.next()
Y=np.zeros(X.shape[0]) # number of calsses is 38
print("X.shape =",X.shape) # X.shape = (128, 224, 224, 3)
print("Y.shape =",Y.shape) # Y.shape = (38,)
######################### Split train+test #######################################
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.20,random_state=2)
# Whole Wine Classifier
ensemble_model.fit(x_train, y_train)
y_pred = ensemble_model.predict(x_test)
from sklearn.metrics import accuracy_score
print("accueacy : ",accuracy_score(y_test,y_pred))
for more information see my project on this link:
my project
I got the same error when running this code:

Why am I getting this error : Found input variables with inconsistent numbers of samples: [1, 15]

I am trying to solve the following problem but I am getting an error.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.r2_score(X_train, y_train)
r2_test = linreg.r2_train(X_test, y_test)
Found input variables with inconsistent numbers of samples: [1, 15]
Any reason why am I getting the said error.
Three errors in the code:
You need to reshape x into a 2D numpy array by using x.reshape(-1,1).
linreg.r2_score is invalid. Also, no need to use r2_score. Just use linreg.score. This will return the coefficient of determination R^2 of the prediction (reference).
degree r2_score be 0 so use PolynomialFeatures(i+1) inside the loop except if you really intend to use a 0 degree polynomial expansion. Keep in mind that if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
Full working example:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
from sklearn.model_selection import train_test_split
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i+1)
x_poly = poly.fit_transform(x.reshape(-1,1))
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
You have not reshaped x. x should be of shape (n_samples, n_features). And linreg.r2_score is no more. I modified the code as following:
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
Your code have lots of mistakes and typos. It will be useful if you first practice some well known solved problem like iris, house price regression problem etc.
Correct code :
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
from sklearn.model_selection import train_test_split
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
#### convert x into 2D matrix #####
x= x.reshape(-1,1)
i=1
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = r2_score(y_train,linreg.predict(X_train))
r2_test = r2_score(y_test ,linreg.predict(X_test))
#### linreg.score(X_train, y_train) can also used to calculate r2_score

Regression using Python

I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?
The return statement should be performed after the for is done, so it should be indented under the for, not further in.
At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

Categories

Resources