kfold cross validation wont terminate, stuck at cross_val_score

kfold cross validation wont terminate, stuck at cross_val_score - python

I am trying to run kfold cross validation. but for some reason, it gets stuck here, it wont terminate from here accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
i cant understand whats the problem. and how do i fix it.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:,1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
import keras
from keras.models import Sequential #Required to initialize the ANN
from keras.layers import Dense #Build layers of ANN
from keras.layers import Dropout
# Evaluating the ANN
import keras
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential #Required to initialize the ANN
from keras.layers import Dense #Build layers of ANN
def build_classifier(): # Builds the architecture, or the classifier
classifier = Sequential()
classifier.add(Dense(activation = 'relu', input_dim = 11, units = 6, kernel_initializer = 'uniform'))# add layers
classifier.add(Dense(activation = 'relu', units = 6, kernel_initializer = 'uniform'))# add layers
classifier.add(Dense(activation = 'sigmoid', units = 1, kernel_initializer = 'uniform'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, nb_epoch = 100)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
mean = accuracies.mean()
variance = accuracies.std()
Edit
Im on windows 10 using Anaconda with python 3.6.
Dataset : Drive Link for dataset
It works perfectly when i set n_jobs = 1 but not when n_jobs = -1

Since you have set the n_jobs = -1, then all the CPUs are being utlised as per the documentation mentioned here. However, you must understand that utilising all the CPUs does not necessarily may lead to reduction in execution time because:
There is an overhead invovled with creation and allocation of reasources to new threads.
Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
Also multithreading in Python has various shortcomings, see here and here.
You can check out a similar issue with GridSearchCV and parallization here in this answer.
Also, as mentioned by #ncfith, there is no current solution for this problem.
References
Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
Similar issue with numpy on MacOS

Related

how to predict multiple dependent columns from 1 independent column

is it possible to predict multiple dependent columns from independent columns?
Problem Statement: I have to predict 5 factors(cEXT, cNEU,cAGR, cCON, cOPN) on the basis of STATUS column, so input variable will be STATUS column only and target variables are (cEXT, cNEU,cAGR, cCON, cOPN).
here in the above data STATUS is an independent column and cEXT, cNEU,cAGR, cCON, cOPN are the dependent columns, how can I predict those?
# independent and dependent variable split
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
right now I am predicting only one column so repeating the same thing 5 times so I am creating 5 models for 5 target variables.
Code:
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
ct = ColumnTransformer([
('step1', TfidfVectorizer(), 'STATUS')
],remainder='drop')
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics
from sklearn.pipeline import Pipeline
# ##########
# RandomForest
# ##########
model = Pipeline([
('column_transformers', ct),
('model', RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto')),
])
# creating 5 models, can I create 1 model?
model_cEXT = model.fit(X_train, y_train['cEXT'])
model_cNEU = model.fit(X_train, y_train['cNEU'])
model_cAGR = model.fit(X_train, y_train['cAGR'])
model_cCON = model.fit(X_train, y_train['cCON'])
model_cOPN = model.fit(X_train, y_train['cOPN'])

You can use multioutput classifier from scikit-learn.
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
clf = MultiOutputClassifier(RandomForestClassifier()).fit(X_train, y_train)
clf.predict(X_test)
Reference:
Official document of MultiOutputClassifier

There is a library scikit-multilearn which is very good for these tasks. There are several ways to do multi-label classification such as PowerSet, ClassifierChain etc. These are very well covered in this library.
Below is a sample of how it will replace your current code.
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
# Rest of your code
==========================
# The new code
from skmultilearn.problem_transform import BinaryRelevance
from scipy.sparse import csr_matrix
classifier = BinaryRelevance(
classifier = RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto'),
require_dense = [False, True]
)
model = Pipeline([
('column_transformers', ct),
('classifier', classifier),
])
model.fit(X_train, y_train.values)
res = model.predict(X_test)
res = csr_matrix(res)
res.todense()
You can explore other methods here.
In TensorFlow you can do this using sigmoid activation and binaryCE loss on all the units. As below:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
tfidf_calculator = TextVectorization(
standardize = 'lower_and_strip_punctuation',
split = 'whitespace',
max_tokens = 100,
output_mode ='tf-idf',
pad_to_max_tokens=False)
tfidf_calculator.adapt(df['Status'].values)
tfids = tfidf_calculator(df['Status'])
X = tfids.numpy()
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(100,)),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(5, activation='sigmoid')
])
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())
model.fit(X_train, y_train, epochs=20, batch_size=32)
The thing to take note of in TensorFlow is that you need a dense matrix as input. There might be a way to use sparse but I didn't find any.

Predicting the square root of a number using Machine Learning

I am trying to create a program in python that uses machine learning to predict the square root of a number. I am listing what all I have done in my program:-
created a csv file with numbers and their squares
extracted the data from csv into suitable variables (X stores squares, y stores numbers)
scaled the data using sklearn's, StandardScaler
built the ANN with two hidden layers each of 6 units (no activation functions)
compiled the ANN using SGD as the optimizer and mean squared error as the loss function
trained the model. Loss was around 0.063
tried predicting but the result is something else.
My actual code:-
import numpy as np
import tensorflow as tf
import pandas as pd
df = pd.read_csv('CSV/SQUARE-ROOT.csv')
X = df.iloc[:, 1].values
X = X.reshape(-1, 1)
y = df.iloc[:, 0].values
y = y.reshape(-1, 1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test_sc = sc.fit_transform(X_test)
X_train_sc = sc.fit_transform(X_train)
sc1 = StandardScaler()
y_test_sc1 = sc1.fit_transform(y_test)
y_train_sc1 = sc1.fit_transform(y_train)
ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=6))
ann.add(tf.keras.layers.Dense(units=6))
ann.add(tf.keras.layers.Dense(units=1))
ann.compile(optimizer='SGD', loss=tf.keras.losses.MeanSquaredError())
ann.fit(x = X_train_sc, y = y_train_sc1, batch_size=5, epochs = 100)
print(sc.inverse_transform(ann.predict(sc.fit_transform([[144]]))))
OUTPUT:- array([[143.99747]], dtype=float32)
Shouldn't the output be 12? Why is it giving me the wrong result?
I am attaching the csv file I used to train my model as well: SQUARE-ROOT.csv

TL;DR: You really need those nonlinearities.
The reason behind it not working could be one (or a combination) of several causes, like bad input data range, flaws in your data, over/underfitting, etc.
However, in this specific case the model you build literally can't learn the function you're trying to approximate, because not having nonlinearities makes this a purely linear model, which can't accurately approximate nonlinear functions.
A Dense layer is implemented as follows:
x_res = activ_func(w*x + b)
where x is the layer input, w the weights, b the bias vector and activ_func the activation function (if one is defined).
Your model, then, mathematically becomes (I'm using indices 1 to 3 for the three Dense layers):
pred = w3 * (w2 * ( w1 * x + b1 ) + b2 ) + b3
= w3*w2*w1*x + w3*w2*b1 + w3*b2 + b3
As you see, the resulting model is still linear.
Add activation functions and your mode becomes capable of learning nonlinear functions too. From there, experiment with the hyperparameters and see how the performance of your model changes.

The reason your code does not work is because you apply fit_transform to your test set, which is wrong. You can fix it by replacing fit_transform(test) to transform(test). Although I don't think StandardScaler is neccessary, please try this:
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
N = 10000
X = np.arange(1, N).reshape(-1, 1)
y = np.sqrt(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
#X_test_sc = sc.fit_transform(X_test) # wrong!!!
X_test_sc = sc.transform(X_test)
sc1 = StandardScaler()
y_train_sc1 = sc1.fit_transform(y_train)
#y_test_sc1 = sc1.fit_transform(y_test) # wrong!!!
y_test_sc1 = sc1.transform(y_test)
ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=32, activation='relu')) # you have 10000 data, maybe you need a little deeper network
ann.add(tf.keras.layers.Dense(units=32, activation='relu'))
ann.add(tf.keras.layers.Dense(units=32, activation='relu'))
ann.add(tf.keras.layers.Dense(units=1))
ann.compile(optimizer='SGD', loss='MSE')
ann.fit(x=X_train_sc, y=y_train_sc1, batch_size=32, epochs=100, validation_data=(X_test_sc, y_test_sc1))
#print(sc.inverse_transform(ann.predict(sc.fit_transform([[144]])))) # wrong!!!
print(sc1.inverse_transform(ann.predict(sc.transform([[144]]))))

How to scale target values of a Keras autoencoder model using a sklearn pipeline?

I'm using sklearn pipelines to build a Keras autoencoder model and use gridsearch to find the best hyperparameters. This works fine if I use a Multilayer Perceptron model for classification; however, in the autoencoder I need the output values to be the same as input. In other words, I am using a StandardScalar instance in the pipeline to scale the input values and therefore this leads to my question: how can I make the StandardScalar instance inside the pipeline to work on both the input data as well as target data, so that they end up to be the same?
I'm providing a code snippet as an example.
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop, Adam
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
X, y = make_classification (n_features = 50, n_redundant = 0, random_state = 0,
scale = 100, n_clusters_per_class = 1)
# Define wrapper
def create_model (learn_rate = 0.01, input_shape, metrics = ['mse']):
model = Sequential ()
model.add (Dense (units = 64, activation = 'relu',
input_shape = (input_shape, )))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (8, activation = 'relu'))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (input_shape, activation = None))
model.compile (loss = 'mean_squared_error',
optimizer = Adam (lr = learn_rate),
metrics = metrics)
return model
# Create scaler
my_scaler = StandardScaler ()
steps = list ()
steps.append (('scaler', my_scaler))
standard_scaler_transformer = Pipeline (steps)
# Create classifier
clf = KerasRegressor (build_fn = create_model, verbose = 2)
# Assemble pipeline
# How to scale input and output??
clf = Pipeline (steps = [('scaler', my_scaler),
('classifier', clf)],
verbose = True)
# Run grid search
param_grid = {'classifier__input_shape' : [X.shape [1]],
'classifier__batch_size' : [50],
'classifier__learn_rate' : [0.001],
'classifier__epochs' : [5, 10]}
cv = KFold (n_splits = 5, shuffle = False)
grid = GridSearchCV (estimator = clf, param_grid = param_grid,
scoring = 'neg_mean_squared_error', verbose = 1, cv = cv)
grid_result = grid.fit (X, X)
print ('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))

You can use TransformedTargetRegressor to apply arbitrary transformations on the target values (i.e. y) by providing either a function (i.e. using func argument) or a transformer (i.e. transformer argument).
In this case (i.e. fitting an auto-encoder model), since you want to apply the same StandardScalar instance on the target values as well, you can use transformer argument. And it could be done in one of the following ways:
You can use it as one of the pipeline steps, wrapping the regressor:
scaler = StandardScaler()
regressor = KerasRegressor(...)
pipe = Pipeline(steps=[
('scaler', scaler),
('ttregressor', TransformedTargetRegressor(regressor, transformer=scaler))
])
# Use `__regressor` to access the regressor hyperparameters
param_grid = {'ttregressor__regressor__hyperparam_name' : ...}
gridcv = GridSearchCV(estimator=pipe, param_grid=param_grid, ...)
gridcv.fit(X, X)
Alternatively, you can wrap it around the GridSearchCV like this:
ttgridcv = TransformedTargetRegressor(GridSearchCV(...), transformer=scalar)
ttgridcv.fit(X, X)
# Use `regressor_` attribute to access the fitted regressor (i.e. `GridSearchCV` instance)
print(ttgridcv.regressor_.best_score_, ttgridcv.regressor_.best_params_))

Trying to implement XGBoost into my Artificial Neural Network

I'm completely unaware as to why i'm receiving this error. I am trying to implement XGBoost but it returns with error "ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric." Even after i've One Hot Encoded my categorical data. If anyone knows what is causing this and a possible solution i'd greatly appreciate it. Here is my code written in Python:
# Artificial Neural Networks - With XGBoost
# PRE PROCESS
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding Categorical Data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1, 2])],
remainder = 'passthrough')
X = np.array(ct.fit_transform(X), dtype = np.float)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 0)
# Fitting XGBoost to the training set
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(x_train, y_train)
# Predicting the Test set Results
y_pred = classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

Keras : GridSearchCV for Hyperparameter Tuning

I'm currently training a CNN for classifying waves. While the code works perfectly, the GridSearchCV for hyperparameter tuning does not work as intended. I was confused because I used similar code for tuning hyperparameters in MLP and it works like a charm. This is the full code, and by the way, I'm using TF as backend.
import pandas as pd
import numpy as np
#Import training set
training_set = pd.read_csv("training_set.csv", delimiter=";")
X_train = training_set.iloc[:,1:].values
y_train = training_set.iloc[:,0:1].values
#Import test set
test_set = pd.read_csv("test_set_v2.csv", delimiter=";")
X_test = test_set.iloc[:,1:].values
y_test = test_set.iloc[:,0:1].values
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)
#Convert X into 3D tensor
X_train = np.reshape(X_train,(X_train.shape[0],X_train.shape[1],1))
X_test = np.reshape(X_test,(X_test.shape[0],X_test.shape[1],1))
#Importing the CNN libraries
from keras.models import Sequential
from keras.layers import Conv1D,MaxPooling1D,Flatten
from keras.layers import Dropout,Dense
from keras.layers.normalization import BatchNormalization
#Parameter tuning
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
def build_classifier(optimizer, dropout1, dropout2):
classifier = Sequential()
classifier.add(Conv1D(filters=4,kernel_size=4,activation='relu',input_shape=(X_train.shape[1],1)))
classifier.add(MaxPooling1D(strides=4))
classifier.add(BatchNormalization())
classifier.add(Flatten())
classifier.add(Dropout(0.25))
classifier.add(Dense(8, activation='relu'))
classifier.add(Dropout(0.25))
classifier.add(Dense(1,activation='sigmoid'))
classifier.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
return classifier
classifier = KerasClassifier(build_fn=build_classifier)
parameters = {'batch_size': [25,32],
'epochs': [5,10],
'optimizer': ['adam', 'rmsprop'],
'dropout1' : [0.2,0.25,3],
'dropout2' : [0.2,0.25,3],
}
grid_search = GridSearchCV(estimator=classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
The strange thing is, it was running perfectly for an epoch then it raises the following error.
File "C:\Program Files\Anaconda3\lib\site-> >packages\keras\wrappers\scikit_learn.py", line 220, in predict
return self.classes_[classes]
IndexError: index 1 is out of bounds for axis 0 with size 1
Can ayone help me? Any kind of help is greatly appreciated! Thanks a lot guys!

SOLVED
Update via github master branch

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

kfold cross validation wont terminate, stuck at cross_val_score - python

Related

how to predict multiple dependent columns from 1 independent column

Predicting the square root of a number using Machine Learning

How to scale target values of a Keras autoencoder model using a sklearn pipeline?

Trying to implement XGBoost into my Artificial Neural Network

Keras : GridSearchCV for Hyperparameter Tuning

Categories

Resources