Python Random forest and machine learning - improvements

Python Random forest and machine learning - improvements - python

I am quite new to using python for machine learning. I come from a background of programming in Fortran, so as you may imagine, python is quite a leap. I work in chemistry and have become involved in chemiformatics (applying data science techniques to chemistry). As such, the application of pythons extensive machine learning libraries is important. I also need my codes to be efficent. I have written a code which runs and seems to work OK. What I would like to know is:
1 How best to improve it/make it more efficient.
2 Any suggestions on alternative formulations to those I have used and if possible a reason why another route maybe superior?
I tend to work with continuous data and regression models.
Any suggestions would be great and thank you in advance for those.
import scipy
import math
import numpy as np
import pandas as pd
import plotly.plotly as py
import os.path
import sys
from time import time
from sklearn import preprocessing, metrics, cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
fname = str(raw_input('Please enter the input file name containing total dataset and descriptors (assumes csv file, column headings and first column are labels\n'))
if os.path.isfile(fname) :
SubFeAll = pd.read_csv(fname, sep=",")
else:
sys.exit("ERROR: input file does not exist")
#SubFeAll = pd.read_csv(fname, sep=",")
SubFeAll = SubFeAll.fillna(SubFeAll.mean()) # replace the NA values with the mean of the descriptor
header = SubFeAll.columns.values # Use the column headers as the descriptor labels
SubFeAll.head()
# Set the numpy global random number seed (similar effect to random_state)
np.random.seed(1)
# Random Forest results initialised
RFr2 = []
RFmse = []
RFrmse = []
# Predictions results initialised
RFpredictions = []
metcount = 0
# Give the array from pandas to numpy
npArray = np.array(SubFeAll)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
# Print specific nparray values to check the data
print("The first element of the input data set, as a minial check please ensure this is as expected = %s" % npArray[0,0])
# Split the data into: names labels of the molecules ; y the True results ; X the descriptors for each data point
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
print X.shape
# Open output files
train_name = "Training.csv"
test_name = "Predictions.csv"
fi_name = "Feature_importance.csv"
with open(train_name,'w') as ftrain, open(test_name,'w') as fpred, open(fi_name,'w') as ffeatimp:
ftrain.write("This file contains the training information for the Random Forest models\n")
ftrain.write("The code use a ten fold cross validation 90% training 10% test at each fold so ten training sets are used here,\n")
ftrain.write("Interation %d ,\n" %(metcount+1))
fpred.write("This file contains the prediction information for the Random Forest models\n")
fpred.write("Predictions are made over a ten fold cross validation hence training on 90% test on 10%. The final prediction are return iteratively over this ten fold cros validation once,\n")
fpred.write("optimised parameters are located via a grid search at each fold,\n")
fpred.write("Interation %d ,\n" %(metcount+1))
ffeatimp.write("This file contains the feature importance information for the Random Forest model,\n")
ffeatimp.write("Interation %d ,\n" %(metcount+1))
# Begin the K-fold cross validation over ten folds
kf = KFold(datax, n_folds=10, shuffle=True, random_state=0)
print "------------------- Begining Ten Fold Cross Validation -------------------"
for train, test in kf:
XTrain, XTest, yTrain, yTest = X[train], X[test], y[train], y[test]
ytestdim = yTest.shape[0]
print("The test set values are : ")
i = 0
if ytestdim%5 == 0:
while i < ytestdim:
print round(yTest[i],2),'\t', round(yTest[i+1],2),'\t', round(yTest[i+2],2),'\t', round(yTest[i+3],2),'\t', round(yTest[i+4],2)
ftrain.write(str(round(yTest[i],2))+','+ str(round(yTest[i+1],2))+','+str(round(yTest[i+2],2))+','+str(round(yTest[i+3],2))+','+str(round(yTest[i+4],2))+',\n')
i += 5
elif ytestdim%4 == 0:
while i < ytestdim:
print round(yTest[i],2),'\t', round(yTest[i+1],2),'\t', round(yTest[i+2],2),'\t', round(yTest[i+3],2)
ftrain.write(str(round(yTest[i],2))+','+str(round(yTest[i+1],2))+','+str(round(yTest[i+2],2))+','+str(round(yTest[i+3],2))+',\n')
i += 4
elif ytestdim%3 == 0 :
while i < ytestdim :
print round(yTest[i],2),'\t', round(yTest[i+1],2),'\t', round(yTest[i+2],2)
ftrain.write(str(round(yTest[i],2))+','+str(round(yTest[i+1],2))+','+str(round(yTest[i+2],2))+',\n')
i += 3
elif ytestdim%2 == 0 :
while i < ytestdim :
print round(yTest[i],2), '\t', round(yTest[i+1],2)
ftrain.write(str(round(yTest[i],2))+','+str(round(yTest[i+1],2))+',\n')
i += 2
else :
while i< ytestdim :
print round(yTest[i],2)
ftrain.write(str(round(yTest[i],2))+',\n')
i += 1
print "\n"
# random forest grid search parameters
print "------------------- Begining Random Forest Grid Search -------------------"
rfparamgrid = {"n_estimators": [10], "max_features": ["auto", "sqrt", "log2"], "max_depth": [5,7]}
rf = RandomForestRegressor(random_state=0,n_jobs=2)
RfGridSearch = GridSearchCV(rf,param_grid=rfparamgrid,scoring='mean_squared_error',cv=10)
start = time()
RfGridSearch.fit(XTrain,yTrain)
# Get best random forest parameters
print("GridSearchCV took %.2f seconds for %d candidate parameter settings" %(time() - start,len(RfGridSearch.grid_scores_)))
RFtime = time() - start,len(RfGridSearch.grid_scores_)
#print(RfGridSearch.grid_scores_) # Diagnos
print("n_estimators = %d " % RfGridSearch.best_params_['n_estimators'])
ne = RfGridSearch.best_params_['n_estimators']
print("max_features = %s " % RfGridSearch.best_params_['max_features'])
mf = RfGridSearch.best_params_['max_features']
print("max_depth = %d " % RfGridSearch.best_params_['max_depth'])
md = RfGridSearch.best_params_['max_depth']
ftrain.write("Random Forest")
ftrain.write("RF search time, %s ,\n" % (str(RFtime)))
ftrain.write("Number of Trees, %s ,\n" % str(ne))
ftrain.write("Number of feature at split, %s ,\n" % str(mf))
ftrain.write("Max depth of tree, %s ,\n" % str(md))
# Train random forest and predict with optimised parameters
print("\n\n------------------- Starting opitimised RF training -------------------")
optRF = RandomForestRegressor(n_estimators = ne, max_features = mf, max_depth = md, random_state=0)
optRF.fit(XTrain, yTrain) # Train the model
RFfeatimp = optRF.feature_importances_
indices = np.argsort(RFfeatimp)[::-1]
print("Training R2 = %5.2f" % optRF.score(XTrain,yTrain))
print("Starting optimised RF prediction")
RFpreds = optRF.predict(XTest)
print("The predicted values now follow :")
RFpredsdim = RFpreds.shape[0]
i = 0
if RFpredsdim%5 == 0:
while i < RFpredsdim:
print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2),'\t', round(RFpreds[i+3],2),'\t', round(RFpreds[i+4],2)
i += 5
elif RFpredsdim%4 == 0:
while i < RFpredsdim:
print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2),'\t', round(RFpreds[i+3],2)
i += 4
elif RFpredsdim%3 == 0 :
while i < RFpredsdim :
print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2)
i += 3
elif RFpredsdim%2 == 0 :
while i < RFpredsdim :
print round(RFpreds[i],2), '\t', round(RFpreds[i+1],2)
i += 2
else :
while i< RFpredsdim :
print round(RFpreds[i],2)
i += 1
print "\n"
RFr2.append(optRF.score(XTest, yTest))
RFmse.append( metrics.mean_squared_error(yTest,RFpreds))
RFrmse.append(math.sqrt(RFmse[metcount]))
print ("Random Forest prediction statistics for fold %d are; MSE = %5.2f RMSE = %5.2f R2 = %5.2f\n\n" % (metcount+1, RFmse[metcount], RFrmse[metcount],RFr2[metcount]))
ftrain.write("Random Forest prediction statistics for fold %d are, MSE =, %5.2f, RMSE =, %5.2f, R2 =, %5.2f,\n\n" % (metcount+1, RFmse[metcount], RFrmse[metcount],RFr2[metcount]))
ffeatimp.write("Feature importance rankings from random forest,\n")
for i in range(RFfeatimp.shape[0]) :
ffeatimp.write("%d. , feature %d , %s, (%f),\n" % (i + 1, indices[i], npheader[indices[i]], RFfeatimp[indices[i]]))
# Store prediction in original order of data (itest) whilst following through the current test set order (j)
metcount += 1
ftrain.write("Fold %d, \n" %(metcount))
print "------------------- Next Fold %d -------------------" %(metcount+1)
j = 0
for itest in test :
RFpredictions.append(RFpreds[j])
j += 1
lennames = names.shape[0]
lenpredictions = len(RFpredictions)
lentrue = y.shape[0]
if lennames == lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lennames) :
fpred.write(str(names[i])+",,"+str(RFpredictions[i])+",,"+str(y[i])+",\n")
else :
fpred.write("ERROR - names, prediction and true value array size mismatch. Dumping arrays for manual inspection in predictions.csv\n")
fpred.write("Array printed in the order names/Labels, predictions RF and true values\n")
fpred.write(names+"\n")
fpred.write(RFpredictions+"\n")
fpred.write(y+"\n")
sys.exit("ERROR - names, prediction and true value array size mismatch. Dumping arrays for manual inspection in predictions.csv")
print "Final averaged Random Forest metrics : "
RFamse = sum(RFmse)/10
RFmse_sd = np.std(RFmse)
RFarmse = sum(RFrmse)/10
RFrmse_sd = np.std(RFrmse)
RFslope, RFintercept, RFr_value, RFp_value, RFstd_err = scipy.stats.linregress(RFpredictions, y)
RFR2 = RFr_value**2
print "Average Mean Squared Error = ", RFamse, " +/- ", RFmse_sd
print "Average Root Mean Squared Error = ", RFarmse, " +/- ", RFrmse_sd
print "R2 Final prediction against True values = ", RFR2
fpred.write("\n")
fpred.write("FINAL PREDICTION STATISTICS,\n")
fpred.write("Random Forest average MSE, %s, +/-, %s,\n" %(str(RFamse), str(RFmse_sd)))
fpred.write("Random Forest average RMSE, %s, +/-, %s,\n" %(str(RFarmse), str(RFrmse_sd)))
fpred.write("Random Forest slope, %s, Random Forest intercept, %s,\n" %(str(RFslope), str(RFintercept)))
fpred.write("Random Forest standard error, %s,\n" %(str(RFstd_err)))
fpred.write("Random Forest R, %s,\n" %(str(RFr_value)))
fpred.write("Random Forest R2, %s,\n" %(str(RFR2)))
ftrain.close()
fpred.close()
ffeatimp.close()

you can also add Feature Selection to your data:
sickit learn feature selection
some feature selection techniques are provided in sickit learn and you can use it to improve some aspect of your DM project

Related

Training a BERT and Running out of memory - Google Colab

I keep running out of memory even after i bought google colab pro which has 25gb RAM usage. I have no idea why is this happening. I tried every kernel possible (Google colab, Google colab pro, Kaggle kernel, Amazon Sagemaker, Google Cloud Platform). I reduced my batch size to 8, no success whatsoever.
My goal is to train Bert in Deep Pavlov (with Russian text classification extension) to predict emotion of the tweet. It is a multiclass classification with 5 classes
Here is my whole code:
!pip3 install deeppavlov
import pandas as pd
train_df = pd.read_csv('train_pikabu.csv')
test_df = pd.read_csv('test_pikabu.csv')
val_df = pd.read_csv('validation_pikabu.csv')
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader
# read data from particular columns of `.csv` file
data = BasicClassificationDatasetReader().read(
data_path='./',
train='train_pikabu.csv',
valid="validation_pikabu_a.csv",
test="test_pikabu.csv",
x = 'content',
y = 'emotions'
)
from deeppavlov.dataset_iterators.basic_classification_iterator import
BasicClassificationDatasetIterator
# initializing an iterator
iterator = BasicClassificationDatasetIterator(data, seed=42, shuffle=True)
!python -m deeppavlov install squad_bert
from deeppavlov.models.preprocessors.bert_preprocessor import BertPreprocessor
bert_preprocessor = BertPreprocessor(vocab_file="./bert/vocab.txt",
do_lower_case=False,
max_seq_length=256)
from deeppavlov.core.data.simple_vocab import SimpleVocabulary
vocab = SimpleVocabulary(save_path="./binary_classes.dict")
iterator.get_instances(data_type="train")
vocab.fit(iterator.get_instances(data_type="train")[1])
from deeppavlov.models.preprocessors.one_hotter import OneHotter
one_hotter = OneHotter(depth=vocab.len,
single_vector=True # means we want to have one vector per sample
)
from deeppavlov.models.classifiers.proba2labels import Proba2Labels
prob2labels = Proba2Labels(max_proba=True)
from deeppavlov.models.bert.bert_classifier import BertClassifierModel
from deeppavlov.metrics.accuracy import sets_accuracy
bert_classifier = BertClassifierModel(
n_classes=vocab.len,
return_probas=True,
one_hot_labels=True,
bert_config_file="./bert/bert_config.json",
pretrained_bert="./bert/bert_model.ckpt",
save_path="sst_bert_model/model",
load_path="sst_bert_model/model",
keep_prob=0.5,
learning_rate=1e-05,
learning_rate_drop_patience=5,
learning_rate_drop_div=2.0
)
# Method `get_instances` returns all the samples of particular data field
x_valid, y_valid = iterator.get_instances(data_type="valid")
# You need to save model only when validation score is higher than previous one.
# This variable will contain the highest accuracy score
best_score = 0.
patience = 2
impatience = 0
# let's train for 3 epochs
for ep in range(3):
nbatches = 0
for x, y in iterator.gen_batches(batch_size=8,
data_type="train", shuffle=True):
x_feat = bert_preprocessor(x)
y_onehot = one_hotter(vocab(y))
bert_classifier.train_on_batch(x_feat, y_onehot)
print("Batch done\n")
nbatches += 1
if nbatches % 1 == 0:
# validating every 100 batches
y_valid_pred = bert_classifier(bert_preprocessor(x_valid))
score = sets_accuracy(y_valid, vocab(prob2labels(y_valid_pred)))
print("Batches done: {}. Valid Accuracy: {}".format(nbatches, score))
y_valid_pred = bert_classifier(bert_preprocessor(x_valid))
score = sets_accuracy(y_valid, vocab(prob2labels(y_valid_pred)))
print("Epochs done: {}. Valid Accuracy: {}".format(ep + 1, score))
if score > best_score:
bert_classifier.save()
print("New best score. Saving model.")
best_score = score
impatience = 0
else:
impatience += 1
if impatience == patience:
print("Out of patience. Stop training.")
break
It runs up to 1 batch and then crushes.

Python LSTM Bitcoin prediction flatlines

I'm currently trying to build a "simple" LSTM model that takes historical Bitcoin data, learns from that and then tries to predict the future X steps in advance.
I've build it on the idea that A + B + C = D so B + C + D should be E. (I think that's a very simple idea behind an LSTM model. I might be wrong however i'm pretty new to it.)
I managed to build the basics in python (I'm fairly new to python) but something seems off by the prediction. For some reason many of the predictions i test / make end up flatlining. I have a theory on why but we have no idea if it's correct and even less idea on how to solve it.
My theory is that within a sequence the model learns to put more importance / weight on the last digit in the sequence because with Bitcoin prices the future price (in 1 minute) is probably pretty close to the price now. That's try the predicted values keeps getting closer to the real value eventually being equal and thus flatlining in a graph. (I don't know if that makes sense but thats what i tought anyway.)
I've also added a screenshot of my graph from a few days ago. Almost all predictions however end similar to this graph. This is just a more extreme example as demonstration.
Here is my code, can someone please explain why it flatlines and what i did wrong?
import numpy as np
from matplotlib import pyplot
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
import yfinance as yf
from sklearn.preprocessing import MinMaxScaler
from math import sqrt
from sklearn.metrics import mean_squared_error
# Create output sets X + Y from given input-set
# with inputset : a 1-dimensional list of floats
# with N : the number of lookback values to use for X
# with Gap : the number of point skipped between X and Y
# Y: is equal to input, (although the first N are missing)
# X: for each y of Y a corresponding set of size N is created
# composed of the N values preceeding y.
def create_lookback(inputset, n=1, gap=0):
print("create_lookback with n=%d gap=%d" % (n,gap))
print(" - length of inputset = %d" % len(inputset))
dataX, dataY = [], []
for i in range(len(inputset) - (n+gap)):
a = inputset[i:(i + n), 0]
dataX.append(a)
dataY.append(inputset[i + n+gap, 0])
print(" - length of dataY = %d" % len(dataY))
data_x = np.array(dataX)
xret = data_x.reshape(data_x.shape[0], 1, data_x.shape[1])
return xret, np.array(dataY)
# Train model based on given training-set + Test-set
def create_model(trainX,trainY,testX,testY):
model = Sequential()
model.add(LSTM(units = 100, input_shape=(trainX.shape[1], trainX.shape[2], )))
model.add(Dropout(0.2))
#model.add(LSTM(30, return_sequences=True))
#model.add(Dropout(0.1))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
history = model.fit(trainX, trainY, epochs=100, batch_size=5, validation_data=(testX, testY), verbose=1, shuffle=False)
return model
# Evaluate given X / Y set.
# - Calculate RMSE
# - Generate visual line-plot to screen
def show_result(scaler,yhat,setY,txt):
print("Show %s result" % txt)
yhat_inverse = scaler.inverse_transform(yhat.reshape(-1, 1))
testY_inverse = scaler.inverse_transform(setY.reshape(-1, 1))
if len(testY_inverse) == len(yhat_inverse):
rmse = sqrt(mean_squared_error(testY_inverse, yhat_inverse))
print(' RMSE %s : %.3f' % (txt,rmse))
pyplot.plot(yhat_inverse, label='predict '+txt)
pyplot.plot(testY_inverse, label='actual '+txt, alpha=0.5)
pyplot.legend()
pyplot.show()
# Extrapoleer is dutch for Extrapolate
def extrapoleer(i,model,tup,toekomst):
if(i == 0):
return
setX = np.array([[tup]])
y = model.predict(setX)
y_float = y[0][0]
tup_new = np.append(tup[1:], y_float)
toekomst.append(y_float)
extrapoleer(i-1, model, tup_new,toekomst)
# --- end of defined functions
# -- start of main flow
data_grid_1 = yf.download('BTC-USD', start="2021-04-14",end="2021-04-15", interval="1m");
data_grid_2 = yf.download('BTC-USD', period="12h", interval="1m");
dataset_1 = data_grid_1.iloc[:, 1:2].values
dataset_2 = data_grid_2.iloc[:, 1:2].values
scaler = MinMaxScaler(feature_range = (0, 1))
scaled = scaler.fit_transform(dataset_1)
# 70% of dataset_1 is used to train ; 30% to test
train_size = int(len(scaled) * 0.7)
test_size = len(scaled) - train_size
train, test = scaled[0:train_size,:], scaled[train_size:len(scaled),:]
print("train: %d test: %d" % (len(train), len(test)))
scaled_2 = scaler.fit_transform(dataset_2)
look_back_n = 3
look_back_gap = 0
trainX, trainY = create_lookback(train, look_back_n, look_back_gap)
testX, testY = create_lookback(test, look_back_n, look_back_gap)
testX_2, testY_2 = create_lookback(scaled_2, look_back_n, look_back_gap)
model = create_model(trainX,trainY,testX,testY)
yhat_1 = model.predict(testX)
yhat_2 = model.predict(testX_2)
show_result(scaler,yhat_1,testY,"test")
show_result(scaler,yhat_2,testY_2,"test2")
last_n = testY_2[-look_back_n:]
#toekomst = Future in dutch
toekomst = []
#aantal = Amount in Dutch, this indicates the amount if steps you want to future predict
aantal = 30
extrapoleer(aantal, model, last_n, toekomst)
print("Resultaat van %d voorspelde punten in de toekomst: " % aantal)
print(toekomst)
yhat_2_plus = np.append(yhat_2,toekomst)
show_result(scaler,yhat_2_plus,testY_2,"test2-plus")

Retrieve cross validation performance (AUC) on h2o AutoML for holdout dataset

I am training a binary classification model with h2o AutoML using the default cross-validation (nfolds=5). I need to obtain the AUC score for each holdout fold in order to compute the variability.
This is the code I am using:
h2o.init()
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()
# set the predictor and response columns
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response_col = "CAPSULE"
# split into train and testing sets
train, test = prostate.split_frame(ratios = [0.8], seed = 1234)
aml = H2OAutoML(seed=1, max_runtime_secs=100, exclude_algos=["DeepLearning", "GLM"],
nfolds=5, keep_cross_validation_predictions=True)
aml.train(predictors, response_col, training_frame=prostate)
leader = aml.leader
I check that leader is not a StackedEnsamble model (for which the validation metrics are not available). Anyway, I am not able to retrieve the five AUC scores.
Any idea on how to do so?

Here's how it's done:
import h2o
from h2o.automl import H2OAutoML
h2o.init()
# import prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()
# set the predictor and response columns
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response_col = "CAPSULE"
# split into train and testing sets
train, test = prostate.split_frame(ratios = [0.8], seed = 1234)
# run AutoML for 100 seconds
aml = H2OAutoML(seed=1, max_runtime_secs=100, exclude_algos=["DeepLearning", "GLM"],
nfolds=5, keep_cross_validation_predictions=True)
aml.train(x=predictors, y=response_col, training_frame=prostate)
# Get the leader model
leader = aml.leader
There is a caveat to mention here about cross-validated AUC -- H2O currently stores two computations of CV AUC. One is an aggregated version (take the AUC of aggregated CV predictions), and the other is the "true" definition of cross-validated AUC (an average of the k AUCs from k-fold cross-validation). The latter is stored in an object which also contains the individual fold AUCs, as well as the standard deviation across the folds.
If you're wondering why we do this, there's some historical & technical reasons why we have two versions, as well as a ticket open to only every report the latter.
The first one is what you get when you do this (and also what appears on the AutoML Leaderboard).
# print CV AUC for leader model
print(leader.model_performance(xval=True).auc())
If you want the fold-wise AUCs so you can compute or view their mean and variability (standard deviation), you can do that by looking here:
# print CV metrics summary
leader.cross_validation_metrics_summary()
Output:
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
----------- ---------- ----------- ------------ ------------ ------------ ------------ ------------
accuracy 0.71842104 0.06419111 0.7631579 0.6447368 0.7368421 0.7894737 0.65789473
auc 0.7767409 0.053587236 0.8206676 0.70905924 0.7982079 0.82538515 0.7303846
aucpr 0.6907578 0.0834025 0.78737605 0.7141305 0.7147677 0.67790955 0.55960524
err 0.28157896 0.06419111 0.23684211 0.35526314 0.2631579 0.21052632 0.34210527
err_count 21.4 4.8785243 18.0 27.0 20.0 16.0 26.0
--- --- --- --- --- --- --- ---
precision 0.61751753 0.08747421 0.675 0.5714286 0.61702126 0.7241379 0.5
r2 0.20118153 0.10781976 0.3014902 0.09386432 0.25050205 0.28393403 0.07611712
recall 0.84506994 0.08513061 0.84375 0.9142857 0.9354839 0.7241379 0.8076923
rmse 0.435928 0.028099842 0.41264254 0.47447023 0.42546 0.41106534 0.4560018
specificity 0.62579334 0.15424488 0.70454544 0.41463414 0.6 0.82978725 0.58
See the whole table with table.as_data_frame()
Here's what the leaderboard looks like (storing aggregated CV AUCs). In this case, because the data is so small (300 rows), there's a noticeable difference between the two reported between the two reported CV AUC values, however for larger datasets, they should be much closer estimates.
# print the whole Leaderboard (all CV metrics for all models)
lb = aml.leaderboard
print(lb)
That will print the top of the leaderboard:
model_id auc logloss aucpr mean_per_class_error rmse mse
--------------------------------------------------- -------- --------- -------- ---------------------- -------- --------
XGBoost_grid__1_AutoML_20200924_200634_model_2 0.769716 0.565326 0.668827 0.290806 0.436652 0.190665
GBM_grid__1_AutoML_20200924_200634_model_4 0.762993 0.56685 0.666984 0.279145 0.437634 0.191524
XGBoost_grid__1_AutoML_20200924_200634_model_9 0.762417 0.570041 0.645664 0.300121 0.440255 0.193824
GBM_grid__1_AutoML_20200924_200634_model_6 0.759912 0.572651 0.636713 0.30097 0.440755 0.194265
StackedEnsemble_BestOfFamily_AutoML_20200924_200634 0.756486 0.574461 0.646087 0.294002 0.441413 0.194845
GBM_grid__1_AutoML_20200924_200634_model_7 0.754153 0.576821 0.641462 0.286041 0.442533 0.195836
XGBoost_1_AutoML_20200924_200634 0.75411 0.584216 0.626074 0.289237 0.443911 0.197057
XGBoost_grid__1_AutoML_20200924_200634_model_3 0.753347 0.57999 0.629876 0.312056 0.4428 0.196072
GBM_grid__1_AutoML_20200924_200634_model_1 0.751706 0.577175 0.628564 0.273603 0.442751 0.196029
XGBoost_grid__1_AutoML_20200924_200634_model_8 0.749446 0.576686 0.610544 0.27844 0.442314 0.195642
[28 rows x 7 columns]

I submitted the following task
https://h2oai.atlassian.net/browse/PUBDEV-8984
This is when you want to order your grid search for a specific metric.
def sort_grid(grid,metric):
#input: grid and metric to order
if metric == 'accuracy':
id = 0
elif metric == 'auc':
id = 1
elif metric=='err':
id = 2
elif metric == 'err_count':
id=3
elif metric=='f0point5':
id=4
elif metric=='f1':
id=5
elif metric =='f2':
id=6
elif metric =='lift_top_group':
id=7
elif metric == 'logloss':
id=8
elif metric == 'max_per_class_error':
id=9
elif metric == 'mcc':
metric=9
elif metric =='mena_per_class_accuracy':
id=10
elif metric == 'mean_per_class_error':
id=11
elif metric == 'mse':
id =12
elif metric == 'pr_auc':
id=13
elif metric == 'precision':
id=14
elif metric == 'r2':
id=15
elif metric =='recall':
id=16
elif metric == 'rmse':
id = 17
elif metric == 'specificity':
id = 18
else:
return 0
model_ids = []
cross_val_values = []
number_of_models = len(grid.model_ids)
number_of_models
for i in range(number_of_models):
modelo_grid = grid[i]
mean = np.array(modelo_grid.cross_validation_metrics_summary()[[1]])
cross_val= mean[0][id]
model_id = grid.model_ids[i]
model_ids.append(model_id)
cross_val_values.append(cross_val)
df = pd.DataFrame(
{'Model_IDs': model_ids, metric: cross_val_values}
)
df = df.sort_values([metric], ascending=False)
best_model = h2o.get_model(df.iloc[0,0])
return df, best_model
#outputs: ordered grid in pandas dataframe and best model
I used this for a binary classification model

How can I tune my neural network to avoid overfitting the mnist data set?

!!!!!!!!!TL;DR at the bottom!!!!!!!!
In an attempt to learn the in's and out's of ML, I have been implementing a neural network optimizer in c++ and wrapped it with swig as a python module. Of course, the first problem I tackled was XOR via the following snip of code: 2 input layers, 2 hidden layers, 1 output layer.
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import time
#=======================================================
# Training Set
#=======================================================
X = [[0,1],[1,0],[1,1],[0,0]]
Y = [[1],[1],[0],[0]]
nIn = len(X[0])
nOut = len(Y[0])
#=======================================================
# Model
#=======================================================
verbosity = 0
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,2,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize the classification optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X,Y)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10000,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#Make a prediction
print(Opt.predict(X))
This snippet of code yields the following output (Correct answer would be [1,1,0,0])
--- 0.10273098945617676 seconds ---
((0.9398755431175232,), (0.9397522211074829,), (0.0612373948097229,), (0.04882470518350601,))
>>>
Looks great!
Now for the issue. The following snippet of code tries to learn from the mnist dataset, but suffers very obviously from overfitting. ~750 input (28X28 pixels), 50 hidden, 10 output
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import matplotlib.pyplot as plt
import numpy as np
import pickle
import time
#=======================================================
# Data Set
#=======================================================
#load the data dictionary
modeldata = pickle.load( open( "mnist_data.p", "rb" ) )
X = modeldata['X']
Y = modeldata['Y']
#normalize data
X = np.array(X)
X = X/255
X = X.tolist()
#training set
X1 = X[0:49999]
Y1 = Y[0:49999]
#validation set
X2 = X[50000:59999]
Y2 = Y[50000:59999]
#number of inputs/outputs
nIn = len(X[0]) #~750
nOut = len(Y[0]) #=10
#=======================================================
# Model
#=======================================================
verbosity = 1
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,50,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X1,Y1)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#================================
#Final Accuracy on training set
#================================
XL = Opt.predict(X1)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y1[i])):
correct = correct + 1
print("Training set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy) + '%')
#================================
#Final Accuracy on validation set
#================================
XL = Opt.predict(X2)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y2[i])):
correct = correct + 1
print("Testing set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy)+'%')
That snippet of code yields the following output which shows the training accuracy and validation accuracy.
-------------------------
Epoch
9
-------------------------
E=
0.00696964
E=
0.350509
E=
3.49568e-05
E=
4.09073e-06
E=
1.38491e-06
E=
0.229873
E=
3.60186e-05
E=
0.000115187
E=
2.29978e-06
E=
2.69165e-06
--- 27.400235176086426 seconds ---
Training set Correct = 48435
Accuracy = 96.87193743874877%
Testing set Correct = 982
Accuracy = 9.820982098209821%
The training set accuracy is great, but then the testing set is no better than a random guess. Any idea what could be causing this?
TL;DR
Solved XOR with a model 2 inputs, 2 hidden, 1 output and sigmoid activation functions. Good results.
Tried to solve the Mnist data set with a model of 750 inputs (28X28 pixels), 50 hidden, 10 output and sigmoid activation functions. Severe overfitting issue. 95% accuracy on the training set, 10% accuracy on validation set.
Any Idea what is causing this?

The cause of overfitting is a combination of the data and model (network in this case). During the training is was 'lazy' and found aspects of the data that worked well in training data but not generalize well.
It is difficult/impossible to point out exactly where in the trained network the nodes/weights are located that are responsible for overfitting.
But we can avoid overfitting with several tricks:
Regularisation
Drop-out (easier to implement)
Change Network Architecture (less layers/less nodes/more dimension-reduction)
https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
To get an idea of regularization, try the playground from tensorflow:
https://playground.tensorflow.org/
A visualisation of dropout
https://yusugomori.com/projects/deep-learning/dropout-relu
Besides try out regularisation techniques, also experiments with different NN architectures.

KNN classifier not working in python on raspberrypi

I am writing a KNN classifier taken from here for character recognition from accelerometer and gyroscopic data.But, the below functions are not working correctly and prediction is not happening.Are there any mistakes in below code? kindly guide me.
trainingset-> training data with 20 samples(10=A,10=B).
testset-> live reading taken for recognition.
#-- KNN Classifier Functions ----------
def loaddataset():
global trainingset
with open('imudata.csv','rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)):
trainingset.append(dataset[x])
def euclideandistance(instance1,instance2,length):
distance = 0
for x in range(length-1):
instance1[x] = float(instance1[x])
instance2[x] = float(instance2[x])
for x in range(length-1):
distance += pow((instance1[x]-instance2[x]),2)
return math.sqrt(distance)
def getneighbours(trainingset,testinstance,k):
distances = []
length = len(testinstance)-1
for x in range(len(trainingset)):
dist = euclideandistance(testinstance, trainingset[x],length)
#print(trainingset[x][-1],dist)
distances.append((trainingset[x],dist))
#print(distances)
distances.sort(key=operator.itemgetter(1))
#print(distances)
neighbours = []
print('k='+repr(k)+'length of distances='+repr(len(distances)))
for x in range(k):
neighbours.append(distances[x][0])
return neighbours
def getresponse(neighbours):
classvotes = {}
for x in range(len(neighbours)):
response = neighbours[x][-1]
if response in classvotes:
classvotes[response] += 1
else:
classvotes[response] = 1
sortedvotes = sorted(classvotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedvotes[0][0]
def getaccuracy(testset, predictions):
correct = 0
for x in range(len(testset)):
if testset[x][-1] is predictions[x]:
correct +=1
return ((correct/float(len(testset))) * 100.0)
#------- END of KNN Classifier Functions -------------
My main compare function is
def compare():
loaddataset()
testset.append(testdata)
print 'Train set: '+ repr(len(trainingset))
print 'Test set: '+ repr(len(testset))
predictions=[]
k = len(trainingset)
for x in range(len(testset)):
neighbours = getneighbours(trainingset,testset[x],k)
result = getresponse(neighbours)
predictions.append(result)
print('>Predicted=' +repr(result)+', actual=' + repr(testset[x][-1]))
accuracy = getaccuracy(testset, predictions)
print('Accuracy: '+repr(accuracy)+'%')
My output is
Train set: 20
Test set: 1
k=20 length of distance=20
>Predicted='A', actual='B'
Accuracy: 0.0%
My sample data packet:
-1.1945864763443935e-16,1.0000000000000031,0.81335962823925234,1.2678119727931405,4.6396523259663871,3,1.0000000000000013,108240.99999999988,328.99999999999966,4.3008487686466931e-16,1.000000000000002,0.73006871826334618,0.88693535629714804,4.3903300136708818,15,1.0000000000000011,108240.99999999977,328.99999999999932,1.990977460573989e-16,1.0000000000000009,0.8120281400849243,1.3556881217171162,4.2839744646260876,9,1.0000000000000004,108240.99999999994,328.99999999999983,-3.4217816017322454e-16,1.0000000000000009,0.7842111273340705,1.0882622268942712,4.4762484049613418,4,1.0000000000000004,108241.00000000038,329.00000000000114,2.6996304550155782e-18,1.000000000000004,0.76504908035654873,1.1890598964371606,4.2138613873737967,7,1.000000000000002,108241.0000000001,329.00000000000028,7.154020705791282e-17,1.0,0.83945423805187047,1.4309844267934049,3.7008217934312198,6,1.0,108240.99999999983,328.99999999999949,-0.66014932688009009,0.48967404184734276,0.083592048161537938,A
I am from hardware and dont know much about KNN, thatswhy I am asking for corrections in my code if any.I added my dataset here.

I could see from your data that number of samples is very less that no. of features, that may affect the accuracy of prediction and the number of samples need to be very high. You can't expect to predict everything correctly, algorithms have their own accuracies. Try to check the correctness of this code by using any other well-known datasets like iris Or try to use built-in knn classifier from scikit learn python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Random forest and machine learning - improvements - python

you can also add Feature Selection to your data: sickit learn feature selection some feature selection techniques are provided in sickit learn and you can use it to improve some aspect of your DM project

Related

Training a BERT and Running out of memory - Google Colab

Python LSTM Bitcoin prediction flatlines

Retrieve cross validation performance (AUC) on h2o AutoML for holdout dataset

How can I tune my neural network to avoid overfitting the mnist data set?

KNN classifier not working in python on raspberrypi

Categories

Resources