Python neural network accuracy - correct implementation? - python

I wrote a simple neural net/MLP and I'm getting some strange accuracy values and wanted to double check things.
This is my intended setup: features matrix with 913 samples and 192 features (913,192). I'm classifying 2 outcomes, so my labels are binary and have shape (913,1). 1 hidden layer with 100 units (for now). All activations will use tanh and all losses use l2 regularization, optimized with SGD
The code is below. It was writtin in python with the Keras framework (http://keras.io/) but my question isn't specific to Keras
input_size = 192
hidden_size = 100
output_size = 1
lambda_reg = 0.01
learning_rate = 0.01
num_epochs = 100
batch_size = 10
model = Sequential()
model.add(Dense(input_size, hidden_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(hidden_size, output_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Activation('tanh'))
sgd = SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd, class_mode="binary")
history = History()
model.fit(features_all, labels_all, batch_size=batch_size, nb_epoch=num_epochs, show_accuracy=True, verbose=2, validation_split=0.2, callbacks=[history])
score = model.evaluate(features_all, labels_all, show_accuracy=True, verbose=1)
I have 2 questions:
This is my first time using Keras, so I want to double check that the code I wrote is actually correct for what I want it to do in terms of my choice of parameters and their values etc.
Using the code above, I get training and test set accuracy hovering around 50-60%. Maybe I'm just using bad features, but I wanted to test to see what might be wrong, so I manually set all the labels and features to something that should be predictable:
labels_all[:500] = 1
labels_all[500:] = 0
features_all[:500] = np.ones(192)*500
features_all[500:] = np.ones(192)
So I set the first 500 samples to have a label of 1, everything else is labelled 0. I set all the features manually to 500 for each of the first 500 samples, and all other features (for the rest of the samples) get a 1
When I run this, I get training accuracy of around 65%, and validation accuracy around 0%. I was expecting both accuracies to be extremely high/almost perfect - is this incorrect? My thinking was that the features with extremely high values all have the same label (1), while the features with low values get a 0 label
Mostly I'm just wondering if my code/model is incorrect or whether my logic is wrong
thanks!

I don't know that library, so I can't tell you if this is correctly implemented, but it looks legit.
I think your problem lies with activation function - tanh(500)=1 and tanh(1)=0.76. This difference seem too small for me. Try using -1 instead of 500 for testing purposes and normalize your real data to something about [-2, 2]. If you need full real numbers range, try using linear activation function. If you only care about positive half on real numbers, I propose softplus or ReLU. I've checked and all those functions are provided with Keras.
You can try thresholding your output too - answer 0.75 when expecting 1 and 0.25 when expecting 0 are valid, but may impact you accuracy.
Also, try tweaking your parameters. I can propose (basing on my own experience) that you'd use:
learning rate = 0.1
lambda in L2 = 0.2
number of epochs = 250 and bigger
batch size around 20-30
momentum = 0.1
learning rate decay about 10e-2 or 10e-3
I'd say that learning rate, number of epochs, momentum and lambda are the most important factors here - in order from most to least important.
PS. I've just spotted that you're initializing your weights uniformly (is that even a word? I'm not a native speaker...). I can't tell you why, but my intuition tells me that this is a bad idea. I'd go with random initial weights.

Related

Keras loss value very high and not decreasing

Firstly, I know that similar questions have been asked before, but mainly for classification problems. Mine is a regression-style problem.
I am trying to train a neural network using keras to evaluate chess positions using stockfish evaluations. The input is boards in a (12,8,8) array (representing piece placement for each individual piece) and output is the evaluation in pawns. When training, the loss stagnates at around 500,000-600,000. I have a little over 12 million boards + evaluations and I train on all the data at once. The loss function is MSE.
This is my current code:
model = Sequential()
model.add(Dense(16, activation = "relu", input_shape = (12, 8, 8)))
model.add(Dropout(0.2))
model.add(Dense(16, activation = "relu"))
model.add(Dense(10, activation = "relu"))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1, activation = "linear"))
model.compile(optimizer = "adam", loss = "mean_squared_error", metrics = ["mse"])
model.summary()
# model = load_model("model.h5")
boards = np.load("boards.npy")
evals = np.load("evals.npy")
perf = model.fit(boards, evals, epochs = 10).history
model.save("model.h5")
plt.figure(dpi = 600)
plt.title("Loss")
plt.plot(perf["loss"])
plt.show()
This is the output of a previous epoch:
145856/398997 [=========>....................] - ETA: 26:23 - loss: 593797.4375 - mse: 593797.4375
The loss will remain at 570,000-580,000 upon further fitting, which is not ideal. The loss should decrease by a few more orders of magnitude if I am not wrong.
What is the problem and how can I fix it to make the model learn better?
I would suspect that your evaluation data contains very big values, like 100000 pawns if one of sides forcefully wins. Than, if your model predicts something like 0 in the same position, then squared error is very high and this pushes MSE high as well. You might want to check your evaluation data and ensure they are in some limited range like [-20..20].
Furthermore, evaluating a chess position is a very complex problem. It looks like your model has too few parameters for the task. Possible improvements:
Increase the numbers of neurons in your dense layers (say to 300,
200, 100).
Increase the numbers of hidden layers (say to 10).
Use convolutional layers.
Besides this, you might want to create a simple "baseline model" to better evaluate the performance of your neural network. This baseline model could be just a python function, which runs on input data and does position evaluation based on material counting (like bishop - 3 pawns, rook - 5 etc.) Than you can run this function on your dataset and see MSE for it. If your neural network produces a smaller MSE than this baseline model, than it is really learning some useful patterns.
I also recommend the following book: "Neural Networks For Chess: The magic of deep and reinforcement learning revealed" by Dominik Klein. The book contains a description of network architecture used in AlphaZero chess engine and a neural network used in Stockfish.

Measuring incertainty in Bayesian Neural Network

Hy everybody,
I'm beginning with tensorflow probability and I have some difficulties to interpret my Bayesian neural network outputs.
I'm working on a regression case, and started with the example provided by tensorflow notebook here: https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html?hl=fr
As I seek to know the uncertainty of my network predictions, I dived directly into example 4 with Aleatoric & Epistemic Uncertainty. You can find my code bellow:
def negative_loglikelihood(targets, estimated_distribution):
return -estimated_distribution.log_prob(targets)
def posterior_mean_field(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size #number of total paramaeters (Weights and Bias)
c = np.log(np.expm1(1.))
return tf.keras.Sequential([
tfp.layers.VariableLayer(2 * n, dtype=dtype, initializer=lambda shape, dtype: random_gaussian_initializer(shape, dtype), trainable=True),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
# The Normal distribution with location loc and scale parameters.
tfd.Normal(loc=t[..., :n],
scale=1e-5 +0.01*tf.nn.softplus(c + t[..., n:])),
reinterpreted_batch_ndims=1)),
])
def prior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
return tf.keras.Sequential([
tfp.layers.VariableLayer(n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
tfd.Normal(loc=t, scale=1),
reinterpreted_batch_ndims=1)),
])
def build_model(param):
model = keras.Sequential()
for i in range(param["n_layers"] ):
name="n_units_l"+str(i)
num_hidden = param[name]
model.add(tfp.layers.DenseVariational(units=num_hidden, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,kl_weight=1/len(X_train),activation="relu"))
model.add(tfp.layers.DenseVariational(units=2, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,activation="relu",kl_weight=1/len(X_train)))
model.add(tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t[..., :1],scale=1e-3 + tf.math.softplus(0.01 * t[...,1:]))))
lr = param["learning_rate"]
optimizer=optimizers.Adam(learning_rate=lr)
model.compile(
loss=negative_loglikelihood, #negative_loglikelihood,
optimizer=optimizer,
metrics=[keras.metrics.RootMeanSquaredError()],
)
return model
I think I have the same network than in tfp example, I just added few hidden layers with differents units. Also I added 0.01 in front of the Softplus in the posterior as suggested here, which allows the network to come up to good performances.
Not able to get reasonable results from DenseVariational
The performances of the model are very good (less than 1% of error) but I have some questions:
As Bayesian neural networks "promise" to mesure the uncertainty of the predictions, I was expecting bigger errors on high variance predictions. I ploted the absolute error versus variance and the results are not good enough on my mind. Of course, the model is better at low variance but I can have really bad predicitions at low variance, and therefore cannot really use standard deviation to filter bad predictions. Why is my Bayesian neural netowrk struggling to give me the uncertainty ?
The previous network was train 2000 epochs and we can notice a strange phenome with a vertical bar on lowest stdv. If I increase the number of epoch up to 25000, my results get better either on training and validation set.
But the phenomene of vertical bar that we may notice on the figure 1 is much more obvious. It seems that as much as I increase the number or EPOCH, all output variance converge to 0.68. Is that a case of overfitting ? Why this value of 0.6931571960449219 and why I can't get lower stdv ? As the phenome start appearing at 2000 EPOCH, am i already overfitting at 2000 epochs ?
At this point stdv is totaly useless. So is there a kind of trade off ? With few epochs my model is less performant but gives me some insigh about uncertainty (even if I think they're not sufficient), where with lot of epochs I have better performances but no more uncertainty informations as all outputs have the same stdv.
Sorry for the long post and the language mistakes.
Thank you in advance for you help and any feed back.
I solved the problem of why my uncertainty could not get lower than 0.6931571960449219.
Actually this value is converging to log(2). This is due to my relu activation function on my last Dense Variational layer.
Indeed, the scale of tfd.Normal is a softplus (tf.math.softplus).
And softplus is implement like that : softplus(x) = log(exp(x) + 1). As my x doesn't go in negative values, my minumum incertainty il log(2).
A basic linear activation function solved the problem and my uncertainty has a normal behavior now.

Neural network versus random forest performance discrepancy

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
print(df)
print()
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())
can you please print the shape of your input ?
I would say check those things first:
that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)
Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

What should I do to get low average loss?

I'm an student in hydraulic engineering, working on a neural network in my internship so it's something new for me.
I created my neural network but it gives me a high loss and I don't know what is the problem ... you can see the code :
def create_model():
model = Sequential()
# Adding the input layer
model.add(Dense(26,activation='relu',input_shape=(n_cols,)))
# Adding the hidden layer
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
# Adding the output layer
model.add(Dense(2))
# Compiling the RNN
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
return model
kf = KFold(n_splits = 5, shuffle = True)
model = create_model()
scores = []
for i in range(5):
result = next(kf.split(data_input), None)
input_train = data_input[result[0]]
input_test = data_input[result[1]]
output_train = data_output[result[0]]
output_test = data_output[result[1]]
# Fitting the RNN to the Training set
model.fit(input_train, output_train, epochs=5000, batch_size=200 ,verbose=2)
predictions = model.predict(input_test)
scores.append(model.evaluate(input_test, output_test))
print('Scores from each Iteration: ', scores)
print('Average K-Fold Score :' , np.mean(scores))
And whene I execute my code, the result is like :
Scores from each Iteration: [[93.90406122928908, 0.8907562990148529], [89.5892979597845, 0.8907563030218878], [81.26530176050522, 0.9327731132507324], [56.46526102659081, 0.9495798339362905], [54.314151876112994, 0.9579831877676379]]
Average K-Fold Score : 38.0159922589274
Can anyone help me please ? how could I do to make the loss low ?
There are several issues, both with your questions and with your code...
To start with, in general we cannot say that an MSE loss of X value is low or high. Unlike the accuracy in classification problems which is by definition in [0, 1], the loss is not similarly bounded, so there is no general way of saying that a particular value is low or high, as you imply here (it always depends on the specific problem).
Having clarified this, let's go to your code.
First, judging from your loss='mean_squared_error', it would seem that you are in a regression setting, in which accuracy is meaningless; see What function defines accuracy in Keras when the loss is mean squared error (MSE)?. You have not shared what exact problem you are trying to solve here, but if it is indeed a regression one (i.e. prediction of some numeric value), you should get rid of metrics=['accuracy'] in your model compilation, and possibly change your last layer to a single unit, i.e. model.add(Dense(1)).
Second, as your code currently is, you don't actually fit independent models from scratch in each of your CV folds (which is the very essence of CV); in Keras, model.fit works cumulatively, i.e. it does not "reset" the model each time it is called, but it continues fitting from the previous call. That's exactly why if you see your scores, it is evident that the model is significantly better in the later folds (which already gives a hint for improving: add more epochs). To fit independent models as you should do for a proper CV, you should move create_model() inside the for loop.
Third, your usage of np.mean() here is again meaningless, as you average both the loss and the accuracy (i.e. apples with oranges) together; the fact that from 5 values of loss between 54 and 94 you end up with an "average" of 38 should have already alerted you that you are attempting something wrong. Truth is, if you dismiss the accuracy metric, as argued above, you would not have this problem here.
All in all, here is how it seems that your code should be in principle (but again, I have not the slightest idea of the exact problem you are trying to solve, so some details might be different):
def create_model():
model = Sequential()
# Adding the input layer
model.add(Dense(26,activation='relu',input_shape=(n_cols,)))
# Adding the hidden layer
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
# Adding the output layer
model.add(Dense(1)) # change to 1 unit
# Compiling the RNN
model.compile(optimizer='adam', loss='mean_squared_error') # dismiss accuracy
return model
kf = KFold(n_splits = 5, shuffle = True)
scores = []
for i in range(5):
result = next(kf.split(data_input), None)
input_train = data_input[result[0]]
input_test = data_input[result[1]]
output_train = data_output[result[0]]
output_test = data_output[result[1]]
# Fitting the RNN to the Training set
model = create_model() # move create_model here
model.fit(input_train, output_train, epochs=10000, batch_size=200 ,verbose=2) # increase the epochs
predictions = model.predict(input_test)
scores.append(model.evaluate(input_test, output_test))
print('Loss from each Iteration: ', scores)
print('Average K-Fold Loss :' , np.mean(scores))

Keras RNN (GRU, LSTM) produces plateau and then improvement

I'm new to Keras (with TensorFlow backend) and am using it to do some simple sentiment analysis on user reviews. For some reason, my recurrent neural network is producing some unusual results that I do not understand.
First, my data is a straight-forward sentiment analysis training and test set from the UCI ML archive. There were 2061 training instances, which is small. The data looks like this:
text label
0 So there is no way for me to plug it in here i... 0
1 Good case, Excellent value. 1
2 Great for the jawbone. 1
3 Tied to charger for conversations lasting more... 0
4 The mic is great. 1
Second, here is a FFNN implementation that produces good results.
# FFNN model.
# Build the model.
model_ffnn = Sequential()
model_ffnn.add(layers.Embedding(input_dim=V, output_dim=32))
model_ffnn.add(layers.GlobalMaxPool1D())
model_ffnn.add(layers.Dense(10, activation='relu'))
model_ffnn.add(layers.Dense(1, activation='sigmoid'))
model_ffnn.summary()
# Compile and train.
model_ffnn.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
EPOCHS = 50
history_ffnn = model_ffnn.fit(x_train, y_train, epochs=EPOCHS,
batch_size=128, validation_split=0.2, verbose=3)
As you can see, the learning curves produce a smooth improvement as the number of epochs increases.
Third, here is the problem. I trained a recurrent neural network with a GRU, as shown below. I also tried an LSTM and saw the same results.
# GRU model.
# Build the model.
model_gru = Sequential()
model_gru.add(layers.Embedding(input_dim=V, output_dim=32))
model_gru.add(layers.GRU(units=32))
model_gru.add(layers.Dense(units=1, activation='sigmoid'))
model_gru.summary()
# Compile and train.
model_gru.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
EPOCHS = 50
history_gru = model_gru.fit(x_train, y_train, epochs=EPOCHS,
batch_size=128, validation_split=0.2, verbose=3)
However, the learning curves are quite unusual. You can see a plateau where neither the loss nor the accuracy improve up to about epoch 17, and then the model starts learning and improving. I have never seen this type of plateau at the start of training before.
Can anyone explain why this plateau is occurring, why it stops and gives way to gradual learning, and how I can avoid it?
Following the comment by #Gerges Dib, I tried out different learning rates in increasing order.
lr = 0.0001
lr = 0.001 (the default learning rate for RMSprop)
lr = 0.01
lr = 0.05
lr = 0.1
This is very interesting. It looks like the plateau was caused by the optimizer's learning rate being too low. The parameters were stuck in a local optima until it could break out. I have not seen this pattern before.

Categories

Resources