How to optimise the structure and parameters of ANN (MLPClassifier) - python

For my University project, I was asked to optimise the structure and parameters of ANN using one or more of the following methods:
Random Search
Meta Learning
Adaptive Boosting
Cascade Correlation
Here is the original code to improve:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
nof_prin_components = 200
pca = PCA(n_components=nof_prin_components, whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
nohn = 200 # nof hidden neurons
clf = MLPClassifier(hidden_layer_sizes=(nohn,),solver='sgd',activation='tanh',
batch_size=256, early_stopping=True).fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
print(classification_report(y_test, y_pred))
I haven't got any problems with implementing Random Search and Grid Search, it definitely makes sense to me, it's well documented and there are lots of examples of how to use it. Here's how I've implemented it:
If it comes to the rest of the methods, I have no idea how to use them. I can't find any useful examples that I can implement in my solution.
The question is: what's the easiest of the listed methods (except Random Search) to implement and describe in the report? How can I implement it along with my MLPClassifier?

Related

Using Random Forest algorithm, I have an over-fitting problem and my model does not seem to generalise well. How can I fix this?

I am using the Random Forest algorithm in Python to classify a large dataset with a large number of features.
It seems that the model is not generalizing well and the problem is overfitting, that means the model is too complex for the given dataset and captures noise in the training data. Don't know what can I do.
This is my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Load dataset and create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create and fit the Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train
)

Random Forest further improvement [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
Following Jason Brownlee's tutorials, I developed my own Random forest classifier code. I paste it below, I would like to know what further improvements can I do to improve the accuracy to my code
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle = True, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)
x_test = scaler.transform(X_test)
# get a list of models to evaluate
def get_models():
models = dict()
# consider tree depths from 1 to 7 and None=full
depths = [i for i in range(1,8)] + [None]
for n in depths:
models[str(n)] = RandomForestClassifier(max_depth=n)
return models
# evaluate model using cross-validation
def evaluate_model(model, X, y):
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the results
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X, y)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
The data, X is a matrix of (140,20000) and y is (140,) categorical.
I got the following results but would like to explore how to improve accuracy further.
>1 0.573 (0.107)
>2 0.650 (0.089)
>3 0.647 (0.118)
>4 0.676 (0.101)
>5 0.708 (0.103)
>6 0.698 (0.124)
>7 0.726 (0.121)
>None 0.700 (0.107)
Here's what stands out to me:
You split the data but do not use the splits.
You're scaling the data, but tree-based methods like random forests do not need this step.
You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
If you use GridSearchCV you don't need to do your own cross validation.
You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.
I would probably do something like this:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X, y = make_classification()
# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)
# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)
# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
scoring='f1_weighted',
cv=cv, verbose=3)
clf.fit(X_train, y_train)
Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.
Almost forgot... you asked about improving the model :) Here are some ideas:
The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
You could try transforming some features (columns in X), or adding new ones.
Look for more data, eg more rows, higher quality labels, etc.
Address any issues with class imbalance.
Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).

Simple CV for classic classification models

I'm looking for the easiest way to teach my students how to perform 10CV, for standard classifiers in sklearn such as logisticregression, knnm, decision tree, adaboost, svm, etc.
I was hoping there was a method that created the folds for them instead of having to loop like below:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
X=df1.drop(['Unnamed: 0','ID','target'],axis=1).values
y=df1.target.values
for train_index, test_index in sss.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = LogisticRegressionCV()
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print(acc)
Seems like there should be an easier way.
I think your question is, whether there is an already existing method for 10-fold cross validation. So to answer it, there is the sklearn documentation, which explains cross validation and also how to use it:
Cross-validation: evaluating estimator performance
Besides that, you can also make use of the sklearn modules for cross validation
Various splitting techniques with modules
Model validation with cross validation
To include a code example, which should work with your code, import the required library
from sklearn.model_selection import cross_val_score
and add this line instead of your loop:
print(cross_val_score(clf, X, y, cv=10))
And your n_splits is just set to 1 by the way, so its 1-fold and not 10-fold in your code.

KFold Cross Validation does not fix overfitting

I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)

How to implement/use Artificial immune system(AIS) in python?

I'm new to machine machine learning algorithms and classification techniques.
I have created a dataset, and trained a model with a SVM in python using sklearn module.
But now I have to change my approach from SVM to artificial immune system. My question thus is, Is there a module for AIS in python that I can use? Just like Sklearn which provides SVM.
If there is none, Where can I find an example or help on implementing one ?
Below is my code in SVM, in case anyone would need it.
# In the name of GOD
# SeyyedMahdi Hassanpour
# SeyyedMahdihp#gmail.com
# SeyyedMahdihp # github
import numpy as np
from sklearn import svm, model_selection
import pandas as pd
df = pd.read_csv('final_dataset123456.csv')
x = np.array(df.drop(['label'], 1))
y = np.array(df['label'])
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.36, random_state=39)
clf = svm.SVC()
clf.fit(x_train, y_train)
accuracy = clf.score(x_test, y_test)
print(accuracy)
ar = [0,0,0,0,0,0,0,0,0,0,0,0,0.2,0.058824,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.25,0,0,0,0,0.020833,0.2,0.090909,0,0.032258,0,0,0,0,0,0.0625,0,0,0,0.058333,0,0,0.1,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
br = [0.5,1,1,0.254902,0.853933,1,1,0.254902,1,0.27451,0.2,1,0.4,0.176471,1,1,1,1,0.625,1,0.125,1,0.393939,0.857143,0.052632,1,0.75,0.847826,1,1,0.583333,0.7,1,1,1,0.729167,0.6,0.818182,1,0.193548,0.333333,1,0.674419,1,1,1,0.8,1,1,0.2,0.37037,1,0.8,0.529412,0.375,1,1,0.23913,1,1,1,1,0.666667,1,1,1,1,0,1,0,1,0.23913,0.7,0.7,1,1,1,1,1,1,1,1,0.23913,1,1,1,1,1,1,1,1,1,0.666667,1,0.7,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1]
example_measures = np.array([ar,br])
example_measures = example_measures.reshape(len(example_measures), -1)
prediction = clf.predict(example_measures)
print(prediction)
There are many artificial immune system applications implemented on WEKA Platform. You can download and use it from sourceforge.
Here is the link:
https://sourceforge.net/directory/?q=artificial+immune+system

Categories

Resources