GridSearchCV/RandomizedSearchCV with LSTM - python

I am stuck on the trying to tune hyperparameters for LSTM via RandomizedSearchCV.
My code is below:
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
from imblearn.pipeline import Pipeline
from keras.initializers import RandomNormal
def create_model(activation_1='relu', activation_2='relu',
neurons_input = 1, neurons_hidden_1=1,
optimizer='Adam' ,
#input_shape = (X_train.shape[1], X_train.shape[2])
#input_shape=(X_train.shape[0],X_train.shape[1]) #input shape should be timesteps, features
):
model = Sequential()
model.add(LSTM(neurons_input, activation=activation_1, input_shape=(X_train.shape[1], X_train.shape[2]),
kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=42),
bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=42)))
model.add(Dense(2, activation='sigmoid'))
model.compile (loss = 'sparse_categorical_crossentropy', optimizer=optimizer)
return model
clf=KerasClassifier(build_fn=create_model, epochs=10, verbose=0)
param_grid = {
'clf__neurons_input': [20, 25, 30, 35],
'clf__batch_size': [40,60,80,100],
'clf__optimizer': ['Adam', 'Adadelta']}
pipe = Pipeline([
('oversample', SMOTE(random_state=12)),
('clf', clf)
])
my_cv = TimeSeriesSplit(n_splits=5).split(X_train)
rs_keras = RandomizedSearchCV(pipe, param_grid, cv=my_cv, scoring='f1_macro',
refit='f1_macro', verbose=3,n_jobs=1, random_state=42)
rs_keras.fit(X_train, y_train)
I keep having an error:
Found array with dim 3. Estimator expected <= 2.
which makes sense, as both GridSearch and RandomizedSearch need [n_samples, n_features] type of array. Does anyone have an experience or suggestion on how to deal with this limitation?
Thank you.
Here is the full traceback of the error:
Traceback (most recent call last):
File "<ipython-input-2-b0be4634c98a>", line 1, in <module>
runfile('Scratch/prediction_lstm.py', wdir='/Simulations/2017-2018/Scratch')
File "\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "Scratch/prediction_lstm.py", line 204, in <module>
rs_keras.fit(X_train, y_train)
File "Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 722, in fit
self._run_search(evaluate_candidates)
File "\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 1515, in _run_search
random_state=self.random_state))
File "\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 711, in evaluate_candidates
cv.split(X, y, groups)))
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
if self.dispatch_one_batch(iterator):
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
self.results = batch()
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 237, in fit
Xt, yt, fit_params = self._fit(X, y, **fit_params)
File "\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 200, in _fit
cloned_transformer, Xt, yt, **fit_params_steps[name])
File "\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 342, in __call__
return self.func(*args, **kwargs)
File "\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 576, in _fit_resample_one
X_res, y_res = sampler.fit_resample(X, y, **fit_params)
File "\Anaconda3\lib\site-packages\imblearn\base.py", line 80, in fit_resample
X, y, binarize_y = self._check_X_y(X, y)
File "\Anaconda3\lib\site-packages\imblearn\base.py", line 138, in _check_X_y
X, y = check_X_y(X, y, accept_sparse=['csr', 'csc'])
File "\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
estimator=estimator)
File "\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 570, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.

This problem is not due to scikit-learn. RandomizedSearchCV does not check the shape of input. That is the work of the individual Transformer or Estimator to establish that the passed input is of correct shape. As you can see from the stack trace, that error is created by imblearn because SMOTE requires data to be 2-D to work.
To avoid that, you can reshape the data manually after SMOTE and before passing it to the LSTM. There are multiple ways to achieve this.
1) You pass 2-D data (without explicitly reshaping as you are doing currently in the following lines):
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
to your pipeline and after the SMOTE step, before your clf, reshape the data into 3-D and then pass it to clf.
2) You pass your current 3-D data to the pipeline, transform it into 2-D to be used with SMOTE. SMOTE will then output new oversampled 2-D data which you then again reshape into 3-D.
I think the better option will be 1. Even in that, you can either:
use your custom class to transform the data from 2-D to 3-D like the following:
pipe = Pipeline([
('oversample', SMOTE(random_state=12)),
# Check out custom scikit-learn transformers
# You need to impletent your reshape logic in "transform()" method
('reshaper', CustomReshaper(),
('clf', clf)
])
or use the already available Reshape class. I am using Reshape.
So the modifier code would be (See the comments):
# Remove the following two lines, so the data is 2-D while going to "RandomizedSearchCV".
# X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
# X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
from keras.layers import Reshape
def create_model(activation_1='relu', activation_2='relu',
neurons_input = 1, neurons_hidden_1=1,
optimizer='Adam' ,):
model = Sequential()
# Add this before LSTM. The tuple denotes the last two dimensions of input
model.add(Reshape((1, X_train.shape[1])))
model.add(LSTM(neurons_input,
activation=activation_1,
# Since the data is 2-D, the following needs to be changed from "X_train.shape[1], X_train.shape[2]"
input_shape=(1, X_train.shape[1]),
kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=42),
bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=42)))
model.add(Dense(2, activation='sigmoid'))
model.compile (loss = 'sparse_categorical_crossentropy', optimizer=optimizer)
return model

Related

Data augmentation using SMOTE for images

I have tried two ways to apply SMOTE function to my dataset. However, I can't figured out how to proceed with the Smote function.
1st method: I have applied data augmentation and then tried to apply SMOTE
train_data_gen = ImageDataGenerator(
rescale=1./255,
zoom_range=0.1,
horizontal_flip=True)
train_g = train_data_gen.flow_from_directory(
data_train,
target_size=(img_height, img_width),
color_mode = "grayscale",
batch_size=batch_size,
class_mode = "sparse"
)
for data, labels in train_g:
label = labels
sm = SMOTE(random_state=42)
train_smote,train_labels = sm.fit_resample(train_g,label)
I have tried the above code but it is taking way too long and didnt give any output.
Second method:
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_train,
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
for data, labels in train_ds:
label = labels
sm = SMOTE(random_state=42)
train_smote,train_labels = sm.fit_resample(train_ds,label)
This is the error i get for the second method
Traceback (most recent call last):
File "trainmvlp.py", line 92, in <module>
train_smote,train_labels = sm.fit_resample(train_ds,label)
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\imblearn\base.py", line 77, in fit_resample
X, y, binarize_y = self._check_X_y(X, y)
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\imblearn\base.py", line 130, in _check_X_y
X, y = self._validate_data(X, y, reset=True, accept_sparse=accept_sparse)
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\sklearn\base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\sklearn\utils\validation.py", line 871, in check_X_y
X = check_array(X, accept_sparse=accept_sparse,
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\User\Anaconda3\envs\gait\lib\site-packages\sklearn\utils\validation.py", line 687, in check_array
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=<PrefetchDataset shapes: ((None, 64, 64, 3), (None,)), types: (tf.float32, tf.int32)>.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I have tried to reshape it but it still shows the same error.
Could anyone tell me what am i doing wrong?
Thank you in advance.

Incompatible dimension for X and Y matrices

I was wondering what i have wrong here i get the error
Traceback (most recent call last):
File "main.py", line 37, in <module>
y_pred = knn.predict(X_test)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/neighbors/classification.py", line149, in predict
neigh_dist, neigh_ind = self.kneighbors(X)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/neighbors/base.py", line 434, in kneighbors
**kwds))
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1448, in pairwise_distances_chunked
n_jobs=n_jobs, **kwds)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1588, in pairwise_distances
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1206, in _parallel_pairwise
return func(X, Y, **kwds)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 232, ineuclidean_distances
X, Y = check_pairwise_arrays(X, Y)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 125, incheck_pairwise_arrays
X.shape[1], Y.shape[1]))
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 38 while Y.shape[1] == 43
I'm new to ai and cant find anything on the internet that really solves this problem, any comment appreciated. This is my code
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
fileName = "breast-cancer-fixed.csv";
df = pd.read_csv(fileName)
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
X_train = OneHotEncoder().fit_transform(X_train)
X_test = OneHotEncoder().fit_transform(X_test)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred))
My csv is massive and i cant upload it here so i put a small snippet in
age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,Class
40-49,premeno,15-19,0-2,yes,3,right,left_up,no,recurrence-events
50-59,ge40,15-19,0-2,no,1,right,central,no,no-recurrence-events
50-59,ge40,35-39,0-2,no,2,left,left_low,no,recurrence-events
40-49,premeno,35-39,0-2,yes,3,right,left_low,yes,no-recurrence-events
40-49,premeno,30-34,3-5,yes,2,left,right_up,no,recurrence-events
50-59,premeno,25-29,3-5,no,2,right,left_up,yes,no-recurrence-events
50-59,ge40,40-44,0-2,no,3,left,left_up,no,no-recurrence-events
40-49,premeno,10-14,0-2,no,2,left,left_up,no,no-recurrence-events
40-49,premeno,0-4,0-2,no,2,right,right_low,no,no-recurrence-events
40-49,ge40,40-44,15-17,yes,2,right,left_up,yes,no-recurrence-events
50-59,premeno,25-29,0-2,no,2,left,left_low,no,no-recurrence-events
60-69,ge40,15-19,0-2,no,2,right,left_up,no,no-recurrence-events
Also if i get rid of the last two line of code ( the prediction code ) it runs fine with no errors
trying adding this line anywhere above the transforms
enc = OneHotEncoder(handle_unknown='ignore')
then change the transform lines to the following
enc = enc.fit(X_train)
X_train = enc.transform(X_train)
X_test = enc.transform(X_test)
I get this error
```Traceback (most recent call last):
File "main.py", line 25, in <module>
X_test = OneHotEncoder().transform(X_test)
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 726, in transform
check_is_fitted(self, 'categories_')
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 914, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This OneHotEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.```

make_pipeline with StandardScalar and KerasRegressors

I'm trying to GridSearchCV epochs and batch_size with the following code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
X_train2 = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
y_train2 = np.ravel(y_train.values)
X_test2 = X_test.values.reshape((X_test.shape[0], 1, X_test.shape[1]))
y_test2 = np.ravel(y_test.values)
def build_model():
model = Sequential()
model.add(LSTM(500, input_shape=(1, X_train.shape[1])))
model.add(Dense(1))
model.compile(loss="mse", optimizer="adam")
return model
new_model = KerasRegressor(build_fn=build_model, verbose=0)
pipe = Pipeline([('s', StandardScaler()), ('reg', new_model)])
param_gridd = {'reg__epochs': [5, 6], 'reg__batch_size': [71, 72]}
model = GridSearchCV(estimator=pipe, param_grid=param_gridd)
# ------------------ if the following two lines are uncommented the code works -> problem with Pipeline?
# param_gridd = {'epochs':[5,6], 'batch_size': [71, 72]}
# model = GridSearchCV(estimator=new_model, param_grid=param_gridd)
fitted = model.fit(X_train2, y_train2, validation_data=(X_test2, y_test2), verbose=2, shuffle=False)
and get the following error:
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
self._run_search(evaluate_candidates)
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
cv.split(X, y, groups)))
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 917, in __call__
if self.dispatch_one_batch(iterator):
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/oblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in __init__
self.results = batch()
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 528, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/home/geo/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py", line 202, in _fit
step, param = pname.split('__', 1)
ValueError: not enough values to unpack (expected 2, got 1)
I suspect that this has something to do with the naming in param_gridd but not really sure what is going on. Note that the code works fine when I eliminate make_pipeline from the code, and GridSearchCV directly on new_model.
I think that problem is with the way fit parameters for KerasRegressor were fed.
validation_data, shuffle are not parameters of GridSearchCV, but the reg.
Try this!
fitted = model.fit(X_train2, y_train2,**{'reg__validation_data':(X_test2, y_test2),'reg__verbose':2, 'reg__shuffle':False} )
EDIT:
Based on the findings of #Vivek kumar, I have wrote a wrapper for your preprocessing.
from sklearn.preprocessing import StandardScaler
class custom_StandardScaler():
def __init__(self):
self.scaler =StandardScaler()
def fit(self,X,y=None):
self.scaler.fit(X)
return self
def transform(self,X,y=None):
X_new=self.scaler.transform(X)
X_new = X_new.reshape((X.shape[0], 1, X.shape[1]))
return X_new
This would help you to implement the standard scaler along with creating a new dimension. Remember we have to convert the evaluation dataset before feeding it as fit_params(), hence a seperate scaler (offline_scaler()) is used to transform that.
from sklearn.datasets import load_boston
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from keras.layers import LSTM
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np
seed = 1
boston = load_boston()
X, y = boston['data'], boston['target']
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=42)
def build_model():
model = Sequential()
model.add(LSTM(5, input_shape=(1, X_train.shape[1])))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='Adam', metrics=['mae'])
return model
new_model = KerasRegressor(build_fn=build_model, verbose=0)
param_gridd = {'reg__epochs':[2,3], 'reg__batch_size':[16,32]}
pipe = Pipeline([('s', custom_StandardScaler()),('reg', new_model)])
offline_scaler = custom_StandardScaler()
offline_scaler.fit(X_train)
X_eval2 = offline_scaler.transform(X_eval)
model = GridSearchCV(estimator=pipe, param_grid=param_gridd,cv=3)
fitted = model.fit(X_train, y_train,**{'reg__validation_data':(X_eval2, y_eval),'reg__verbose':2, 'reg__shuffle':False} )
As #AI_Learning said, this line should work:
fitted = model.fit(X_train2, y_train2,
reg__validation_data=(X_test2, y_test2),
reg__verbose=2, reg__shuffle=False)
Pipeline requires parameters to be named as "component__parameter". So prepending reg__ to the parameters work.
This however won't work because the StandardScaler will complain about the data dimensions. You see, when you did:
X_train2 = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
...
X_test2 = X_test.values.reshape((X_test.shape[0], 1, X_test.shape[1]))
You made the X_train2 and X_test2 a 3-D data. This you have done to make it work for LSTM but wont work with StandardScaler because that requires a 2-D data of shape (n_samples, n_features).
If you remove the StandardScaler from your pipe like this:
pipe = Pipeline([('reg', new_model)])
And try the code me and #AI_Learning suggested, it will work. This shows that its nothing to do with pipeline, but your usage of incompatible transformers together.
You can take the StandardScaler out of the pipeline and do this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
X_train2 = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
y_train2 = np.ravel(y_train.values)
...
...

python pipeline does not execute imputer

I analysing the gapminder dataset [1] using a pipeline in Python but for some reason the imputer does not replace the nan values. According to the documentation ("For missing values encoded as np.nan, use the string value “NaN”.") I should do it like below but the code crashes with "ValueError: Input contains NaN" in the line "gm_cv.fit(X_train, y_train)". But gm_cv was created based on the pipeline and the pipeline contains the imputation which should remove the nans. Why does this not work?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
fn = 'gapminder.csv'
df = pd.read_csv(fn, delimiter=',')
# replace empty strings with numpy nans
df.replace('', np.nan, inplace=True)
df.replace(' ', np.nan, inplace=True)
targetVariable = 'lifeexpectancy'
X = df.drop([targetVariable, 'country'], axis=1).values
y = df[targetVariable]
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
('scaler', StandardScaler()),
('elasticnet', ElasticNet())]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio': np.linspace(0,1,30)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.4, random_state=98)
# Create the GridSearchCrossValidation object
gm_cv = GridSearchCV(pipeline, parameters, cv=3)
# Fit to the training set
gm_cv.fit(X_train, y_train)
# results:
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
stack trace of full error:
python.exe pipline_and_classification_II.py
Traceback (most recent call last):
File "pipline_and_classification_II.py", line 55, in <module>
gm_cv.fit(X_train, y_train)
File "lib\site-packages\sklearn\model_selection\_search.py", line 639, in fit
cv.split(X, y, groups)))
File "lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
self.results = batch()
File "lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "lib\site-packages\sklearn\model_selection\_validation.py", line 458, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "lib\site-packages\sklearn\pipeline.py", line 250, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "lib\site-packages\sklearn\linear_model\coordinate_descent.py", line 709, in fit
ensure_2d=False)
File "lib\site-packages\sklearn\utils\validation.py", line 453, in check_array
_assert_all_finite(array)
File "lib\site-packages\sklearn\utils\validation.py", line 44, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Process finished with exit code 1
Update:
Debugging it slowly (without the pipeline) shows that the Imputer does not like 1d arrays (like y in my code above). When doing the nan-removing manually before with the code below it works.
y = np.array(y)
idx = np.argwhere(np.isnan(y))
y[idx] = np.nanmean(y)
But this defeats the purpose of the pipeline. Any ideas how to get this running without manual tinkering?
[1] http://makemeanalyst.com/download-and-learn-about-gapminder-dataset/

MLP classification fitting

I'm new to Machine Learning and I'm working on a python application that classifies poker hands using a dataset which I will post snippets. It does not seem to work well. And I am getting the following error:
Traceback (most recent call last):
File "C:\Users\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-62-0d21cd839ce4>", line 1, in <module>
mlp.fit(X_test, y_train.values.reshape(len(y_train), 1))
File "C:\Users\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 618, in fit
return self._fit(X, y, incremental=False)
File "C:\Users\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 330, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 902, in _validate_input
multi_output=True)
File "C:\Users\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 531, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 181, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [6253, 18757]
here is the code I am trying to produce:
import pandas as pnd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
training_data = pnd.read_csv("train.csv")
training_data['id'] = range(1, len(training_data) + 1) # For 1-base index
training_datafile = training_data
target = training_datafile['hand']
data = training_datafile.drop(['id', 'hand'], axis=1)
X = data
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.shape
y_train.shape
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
mlp = MLPClassifier(hidden_layer_sizes=(100, 100, 100))
mlp.fit(X_test, y_train.values.reshape(len(y_train), 1))
predictions = mlp.predict(X_test)
len(mlp.coefs_)
len(mlp.coefs_[0])
len(mlp.intercepts_[0])
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
The shape of X_train.shape is (18757, 10) and the shape of y_train.shape is (18757,)
I have tried using following previous post
y_train.values.reshape(len(y_train), 1)
But I still get the same error. Some guidance would be of much help since I am not sure of what the shape has wrong.
Data snippet:
You are fiting X_test instead of X_train.
mlp.fit(X_train, y_train.values.reshape(len(y_train), 1))

Categories

Resources