I'm trying to tune the hyperparameters of MLP classifier using GridSearchCV but facing the following issue:
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.
Details:
ValueError: learning rate 0.01 is not supported.
FitFailedWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.
Details:
ValueError: learning rate 0.02 is not supported
........
Code:
clf = MLPClassifier()
params= {
'hidden_layer_sizes': hidden_layers_generator(X,np.arange(1,17,1)),
'solver': ['sgd'],
'momentum': np.arange(0.1,1.1,0.1),
'learning_rate': np.arange(0.01,1.01,0.01),
'max_iter': np.arange(100,2100,100)}
grid = GridSearchCV(clf, params, cv=10, scoring='accuracy')
grid.fit(X, y)
grid_mean_scores = grid.cv_results_['mean_test_score']
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
The code of hidden_layers_generator is as follows:
from itertools import combinations_with_replacement
def hidden_layers_generator(df,hidden_layers):
hd_sizes = []
for l in range(1, len(hidden_layers)):
comb = combinations_with_replacement(np.arange(1,len(df.columns),10), l)
hd_sizes.append(list(comb))
return hd_sizes
Here's a small snippet of X and y dataframes:
X.head()
sl sw pl pw
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
y.head()
0 0
1 1
2 1
3 0
4 0
If you look at the documentation of MLPClassifier, you will see that learning_rate parameter is not what you think but instead, it is a kind of scheduler. What you want is learning_rate_init parameter. So change this line in the configuration:
'learning_rate': np.arange(0.01,1.01,0.01),
to
'learning_rate_init': np.arange(0.01,1.01,0.01),
Related
I am testing an SVM with a sigmoid kernel on the iris data using sklearn and SVC. Its performance is extremely poor with an accuracy of 25 %. I'm using exactly the same code and normalizing the features as https://towardsdatascience.com/a-guide-to-svm-parameter-tuning-8bfe6b8a452c (sigmoid section) which should increase performance substantially. However, I am not able to reproduce his results and the accuracy only increases to 33 %.
Using other kernels (e.g linear kernel) produces good results (accuracy of 82 %).
Could there be an issue within the SVC(kernel = 'sigmoid') function?
Python code to reproduce problem:
##sigmoid iris example
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.svm import SVC
sepal_length = iris.data[:,0]
sepal_width = iris.data[:,1]
#assessing performance of sigmoid SVM
clf = SVC(kernel='sigmoid')
clf.fit(np.c_[sepal_length, sepal_width], iris.target)
pr=clf.predict(np.c_[sepal_length, sepal_width])
pd.DataFrame(classification_report(iris.target, pr, output_dict=True))
from sklearn.metrics.pairwise import sigmoid_kernel
sigmoid_kernel(np.c_[sepal_length, sepal_width])
#normalizing features
from sklearn.preprocessing import normalize
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
sigmoid_kernel(np.c_[sepal_length_norm, sepal_width_norm])
#assessing perfomance of sigmoid SVM with normalized features
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True))
I see what's happening. In sklearn releases pre 0.22 the default gamma parameter passed to the SVC was "auto", and in subsequent releases this was changed to "scale". The author of the article seems to have been using a previous version and therefore implicitly passing gamma="auto" (he mentions that the "current default setting for gamma is ‘auto’"). So if you're on the latest release of sklearn (0.23.2), you'll want to explicitly pass gamma='auto' when instantiating the SVC:
clf = SVC(kernel='sigmoid',gamma='auto')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
So now when you print the classification report:
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.907407 0.650000 0.750000 0.766667 0.769136 0.769136
# recall 0.980000 0.780000 0.540000 0.766667 0.766667 0.766667
# f1-score 0.942308 0.709091 0.627907 0.766667 0.759769 0.759769
# support 50.000000 50.000000 50.000000 0.766667 150.000000 150.000000
What would explain the 33% accuracy you were seeing is the fact that the default gamma is "scale", which then places all predictions in a single region of the decision plane, and as the targets are split into thirds you get a maximum accuracy of 33.3%:
clf = SVC(kernel='sigmoid')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
X = np.c_[sepal_length_norm, sepal_width_norm]
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.0 0.0 0.333333 0.333333 0.111111 0.111111
# recall 0.0 0.0 1.000000 0.333333 0.333333 0.333333
# f1-score 0.0 0.0 0.500000 0.333333 0.166667 0.166667
# support 50.0 50.0 50.000000 0.333333 150.000000 150.000000
I'm trying to get 10 fold confusion matrix for any models (Random forest, Decision tree, Naive Bayes. etc)
I can able to get each confusion matrix normally if I run for normal model as below shown:
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
# implementing train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=66)
# random forest model creation
rfc = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
rfc.fit(X_train,y_train)
# predictions
rfc_predict = rfc.predict(X_test)
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
Out[1]:
=== Confusion Matrix ===
[[16243 1011]
[ 827 16457]]
=== Classification Report ===
precision recall f1-score support
0 0.95 0.94 0.95 17254
1 0.94 0.95 0.95 17284
accuracy 0.95 34538
macro avg 0.95 0.95 0.95 34538
weighted avg 0.95 0.95 0.95 34538
But, now I want to get confusion matrix for 10 cv fold. How should I approach or do it. I tried this but not working.
# from sklearn import cross_validation
from sklearn.model_selection import cross_validate
kfold = KFold(n_splits=10)
conf_matrix_list_of_arrays = []
kf = cross_validate(rfc, X, y, cv=kfold)
print(kf)
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
conf_matrix = confusion_matrix(y_test, rfc.predict(X_test))
conf_matrix_list_of_arrays.append(conf_matrix)
Dataset consists of this dataframe dp
Temperature Series Parallel Shading Number of cells Voltage(V) Current(I) I/V Solar Panel Cell Shade Percentage IsShade
30 10 1 2 10 1.11 2.19 1.97 1985 1 20.0 1
27 5 2 10 10 2.33 4.16 1.79 1517 3 100.0 1
30 5 2 7 10 2.01 4.34 2.16 3532 1 70.0 1
40 2 4 3 8 1.13 -20.87 -18.47 6180 1 37.5 1
45 5 2 4 10 1.13 6.52 5.77 8812 3 40.0 1
From the help page for cross_validate it doesn't return the indexes used for cross-validation. You need to access the indices from the (Stratified)KFold, using an example dataset:
from sklearn import datasets, linear_model
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
data = datasets.load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=66)
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
skf.split(X_train,y_train)
rfc = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
y_pred = cross_val_predict(rfc, X_train, y_train, cv=skf)
We apply cross_val_predict to get all the predictions:
y_pred = cross_val_predict(rfc, X, y, cv=skf)
Then use the indices to split this y_pred to each confusion matrix:
mats = []
for train_index, test_index in skf.split(X_train,y_train):
mats.append(confusion_matrix(y_train[test_index],y_pred[test_index]))
Looks like this:
mats[:3]
[array([[13, 2],
[ 0, 23]]),
array([[14, 1],
[ 1, 22]]),
array([[14, 1],
[ 0, 23]])]
Check that the addition of the matrices list and total sum is the same:
np.add.reduce(mats)
array([[130, 14],
[ 6, 225]])
confusion_matrix(y_train,y_pred)
array([[130, 14],
[ 6, 225]])
For me the problem here stands in the incorrect unpacking of kf. Indeed, cross_validate() returns a dictionary of arrays with test_scores and fit/score times by default.
You can leverage instead on split() method of your Kfold instance, that helps you generating indices to split data into training and test(validation) set. Therefore, by changing into
for train_index, test_index in kfold.split(X_train, y_train):
you should get what you are looking for.
I am going to apply a negative binomial regression model on the dataset and examine the model scores and the features' weight and significance using cross-validation (K-Fold). Here is the dataframe after applying the MinMax scaler. w4 is a categorial variable.
data.head()
w1 w2 w3 w4 Y
0 0.17 0.44 0.00 2004 1
1 0.17 0.83 0.22 2004 0
2 0.00 1.00 0.34 2005 0
3 1.00 0.00 1.00 2005 1
4 1.00 0.22 0.12 2006 3
I used the following code to get the score of the trained model on the test dataset, but it seems there is a problem in addressing the train and test dataset for the model. I appreciate if anyone can help.
scores = []
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
for train, test in kfold.split(data):
model = smf.glm(formula = "Y ~ w1 + w2 + w3 + C(w4)", data=X.iloc[train,:], family=sm.families.NegativeBinomial()).fit()
scores = scores.append(model.get_prediction(X.iloc[test,:])
print(scores)
Have you defined the X nad Y? It seems that you are passing the data DataFrame to the kfold.split method, yet you later reference the X and Y as data objects. Try setting up X = data[['w1', 'w2', 'w3', 'w4']] first, and then reference them as you did in your example.
Also, I noticed that you overwrite your original scores list in scores = model.get_prediction(X.iloc[test,:])
For instance:
X = data[['w1', 'w2', 'w3', 'w4']].values
Y = data['Y'].values
preds, scores = [], []
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
for train_idx, test_idx in kfold.split(data):
X_train, X_test = X[train_idx], X[test_idx]
y_test = Y[test_idx]
model = smf.glm(formula = "Y ~ w1 + w2 + w3 + C(w4)",
data=X_train,
family=sm.families.NegativeBinomial()).fit()
preds.append(model.get_prediction(X_test))
scores.append(model.score(X_test, y_test))
print(scores)
How to handle the error ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous-multioutput' instead ?
I tried something with from sklearn.utils.multiclass import type_of_target or x[0],y[0], but without success ...
Vizualization of X:
Vizualization of Y:
X.shape, Y.shape
((336, 10), (336, 5))
Deep learning model:
for train, test in kfold.split(X, Y):
model = Sequential()
model.add(Dense(10, input_dim=20,
kernel_regularizer=l2(0.001),
kernel_initializer=VarianceScaling(),
activation='sigmoid'))
model.add(Dense(5,
kernel_regularizer=l2(0.01),
kernel_initializer=VarianceScaling(),
activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['acc'])
model.fit(X[train], Y[train], epochs=50, batch_size=25, verbose = 0,
validation_data=(X[test], Y[test]))
scores = model.evaluate(X[test], Y[test], verbose=0)
print("%s: %.2f%%" % (model.metrics_names[2], scores[2]*100))
cvscores.append(scores[2] * 100)
---------------------------------------------------------------------------
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous-multioutput' instead.
StratifiedKFold is not meant to be used for multilabel targets as already pointed out here. It needs a 1D-array to determine how to split the indices.
I suppose you want to split your target based on the label with the highest probability. One way to achieve this goal would be to create a 1D-array indicating the target with the highest probability and pass this one to StratifiedKFold instead of the multilabel target.
Let's say you have your sample data in a pandas DataFrame y and it looks like this:
0 1 2 3 4
0 0.966 0.000 0.0 0.2 0.0
1 0.966 0.000 0.0 0.0 0.2
2 0.000 0.966 0.5 0.0 0.0
3 0.000 0.966 0.0 0.0 0.0
4 0.966 0.000 0.0 0.0 0.0
Then, create a new object with idxmax to find the target with highest probability:
y_max = y.idxmax(axis=1)
This gives you an output like this:
0 0
1 0
2 1
3 1
4 0
dtype: int64
Now you can pass this array to StratifiedKFold and obtain the indices you need:
for train, test in kfold.split(X, y_max):
...
model.fit(X[train], Y[train], epochs=50, batch_size=25, verbose = 0,
validation_data=(X[test], Y[test]))
scores = model.evaluate(X[test], Y[test], verbose=0)
print("%s: %.2f%%" % (model.metrics_names[2], scores[2]*100))
cvscores.append(scores[2] * 100)
This way, you can obtain the indices from a 1D-array and still use the original data for training and testing. If your data happens to be in a numpy array, the same can be achieved with numpy's argmax function.
My question is about preprocessing csv files before inputing them into a neural network.
I want to build a deep neural network for the famous iris dataset using tflearn in python 3.
Dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
I'm using tflearn to load the csv file. However, the classes column of my data set has words such as iris-setosa, iris-versicolor, iris-virginica.
Nueral networks work only with numbers. So, I have to find a way to change the classes from words to numbers. Since it is a very small dataset, I can do it manually using Excel/text editor. I manually assigned numbers for different classes.
But, I can't possibly do it for every dataset I work with. So, I tried using pandas to perform one hot encoding.
preprocess_data = pd.read_csv("F:\Gautam\.....\Dataset\iris_data.csv")
preprocess_data = pd.get_dummies(preprocess_data)
But now, I can't use this piece of code:
data, labels = load_csv('filepath', categorical_labels=True,
n_classes=3)
'filepath' should only be a directory to the csv file, not any variable like preprocess_data.
Original Dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 iris-versicolor
85 6.0 3.4 4.5 1.6 iris-versicolor
31 5.4 3.4 1.5 0.4 iris-setosa
52 6.9 3.1 4.9 1.5 iris-versicolor
111 6.4 2.7 5.3 1.9 iris-virginica
Manually modified dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 1
85 6.0 3.4 4.5 1.6 1
31 5.4 3.4 1.5 0.4 0
52 6.9 3.1 4.9 1.5 1
111 6.4 2.7 5.3 1.9 2
Here's my code which runs perfectly, but, I have modified the dataset manually.
import numpy as np
import pandas as pd
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_csv
data_source = 'F:\Gautam\.....\Dataset\iris_data.csv'
data, labels = load_csv(data_source, categorical_labels=True,
n_classes=3)
network = input_data(shape=[None, 4], name='InputLayer')
network = fully_connected(network, 9, activation='sigmoid', name='Hidden_Layer_1')
network = fully_connected(network, 3, activation='softmax', name='Output_Layer')
network = regression(network, batch_size=1, optimizer='sgd', learning_rate=0.2)
model = tflearn.DNN(network)
model.fit(data, labels, show_metric=True, run_id='iris_dataset', validation_set=0.1, n_epoch=2000)
I want to know if there's any other built-in function in tflearn (or in any other module, for that matter) that I can use to modify the value of my classes from words to numbers. I don't think manually modifying the datasets would be productive.
I'm a beginner in tflearn and neural networks also. Any help would be appreciated. Thanks.
Use label encoder from sklearn library:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df = pd.read_csv('iris_data.csv',header=None)
df.columns=[Sepal Length,Sepal Width,Petal Length,Petal Width,Class]
enc=LabelEncoder()
df['Class']=enc.fit_transform(df['Class'])
print df.head(5)
if you want One-hot encoding then first you need to labelEncode then do OneHotEncoding :
enc=LabelEncoder()
enc_1=OneHotEncoder()
df['Class']=enc.fit_transform(df['Class'])
df['Class']=enc_1.fit_transform([df['Class']]).toarray()
print df.head(5)
These encoders first sort the words in alphabetical order then assign them labels. If you want to see which label is assigned to which class, do:
for k in list(enc.classes_) :
print 'name ::{}, label ::{}'.format(k,enc.transform([k]))
If you want to save this dataframe as a csv file, do:
df.to_csv('Processed_Irisdataset.csv',sep=',')
The simpliest solution is map by dict of all possible values:
df['Class'] = df['Class'].map({'iris-versicolor': 1, 'iris-setosa': 0, 'iris-virginica': 2})
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 1
1 85 6.0 3.4 4.5 1.6 1
2 31 5.4 3.4 1.5 0.4 0
3 52 6.9 3.1 4.9 1.5 1
4 111 6.4 2.7 5.3 1.9 2
If want generate dictionary by all unique values:
d = {v:k for k, v in enumerate(df['Class'].unique())}
print (d)
{'iris-versicolor': 0, 'iris-virginica': 2, 'iris-setosa': 1}
df['Class'] = df['Class'].map(d)
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 0
1 85 6.0 3.4 4.5 1.6 0
2 31 5.4 3.4 1.5 0.4 1
3 52 6.9 3.1 4.9 1.5 0
4 111 6.4 2.7 5.3 1.9 2