Simple neural network refuses to overfit - python

I wrote this super simple piece of code
model = Sequential()
model.add(Dense(1, input_dim=d, activation='linear'))
model.compile(loss='mse', optimizer='adam')
model.fit(X_train, y_train, epochs=10000, batch_size=n)
test_mse = model.evaluate(X_test, y_test)
print('test mse is {}'.format(test_mse))
X_train is an n by d numpy matrix and y is n by 1 numpy matrix.
This is basically the simplest linear neural network you could think of. One layer, input dimension is d, and we output a number.
It simply refuses to overfit. Even after running an insane amount of iterations (10k as you can see), the training loss is at around 0.17.
I expect the loss to be zero. Why do I expect that? Because in my case, d is much greater than n. I have a lot more degrees of freedom. And as a further piece of evidence, when I actually solve X_train # w = y_train using numpy.linalg.lstsq, the max value of X_train # w - y is something like 10 to the -14.
So this system is definitely solvable. I expected to see zero loss or very close to zero loss, but I don't. Why?

Related

Computing the loss (MSE) for every iteration and time Tensorflow

I want to use Tensorboard to plot the mean squared error (y-axis) for every iteration over a given time frame (x-axis), say 5 minutes.
However, i can only plot the MSE given every epoch and set a callback at 5 minutes. This does not however solve my problem.
I have tried looking at the internet for some solutions to how you can maybe set a maximum number of iterations rather than epochs when doing model.fit, but without luck. I know iterations is the number of batches needed to complete one epoch, but as I want to tune the batch_size, I prefer to use the iterations.
My code currently looks like the following:
input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 3
weights_initializer = keras.initializers.GlorotUniform()
#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
model = keras.models.Sequential([
#Layer to be used as an entry point into a Network
keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
#Dense layer 1
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_1'),
#Dense layer 2
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_2'),
#activation function is linear since we are doing regression
keras.layers.Dense(output_size, activation='linear', name='Output_layer')
])
#Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
nesterov=False)
#Compiling the model
model.compile(optimizer=optimizer,
loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
# initialize TimeStopping callback
time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
#Training the network
history = model.fit(normed_train_data, train_labels,
epochs=n_epochs,
batch_size=hparams['batch_size'],
verbose=1,
#validation_split=0.2,
callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
return history
#train_val_model("logs/sample", {'batch_size': len(normed_train_data)})
train_val_model("logs/sample1", {'batch_size': 1})
%tensorboard --logdir_spec=BSG:logs/sample,SGD:logs/sample1
resulting in:
The desired output should look something like this:
The reason you can't do it every iteration is that the loss is calculated at the end of each epoch. If you want to tune the batch size, run for a set number of epochs and evaluate. Start from 16 and jump in powers of 2 and see how much you can push the power of your network. But, usually bigger batch size is said to increase performance but it is not as substantial to solely focus on it. Focus on other things in the network first.
The answer was actually quite simple.
tf.keras.callbacks.TensorBoard has an update_freq argument allowing you to control when to write losses and metrics to tensorboard. The standard is epoch, but you can change it to batch or an integer if you want to write to tensorboard every n batches. See the documentation for more information: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard

Keras LSTM neural network for Time Series Predictions shows nan during model fit

I am training a neural network to predict a whole day of availability (144 samples, 6 features) by passing yesterday's availability (144 samples). I'm having trouble finding good resources or explanations on how to define a neural network to predict time series in a regression problem. The training is defined as a supervised learning problem. My definition of the neural network is,
lstm_neurons = 30
model = Sequential()
model.add(LSTM(lstm_neurons * 2, input_shape=(self.train_x.shape[1], sel f.train_x.shape[2]), return_sequences=True))
model.add(LSTM(lstm_neurons * 2))
model.add(Dense(len_day, activation='softmax'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics = [rm se, 'mae', 'mape'])
I am training for 20 epochs with a batch size of 200 where the used datasets have the following shapes,
Train X (9631, 144, 6)
Train Y (9631, 144)
Test X (137, 144, 6)
Test Y (137, 144)
Validation X (3990, 144, 6)
Validation Y (3990, 144)
All of this produces nan values during training for loss, rmse, mae... While this looks like it's a problem I can use the generated model to generate predictions and they look good-ish.
The first question to ask - are you trying to predict a time series based on interpreting availability as a probability measure?
The softmax activation function would work best under this scenario - but you may be misspecifying it when you are in fact attempting to forecast an interval time series - hence why you are obtaining NaN readings for your results.
This example might be of use to you - LSTM is used to this example to forecast weekly fluctuations in hotel cancellations.
Similarly to your example, X_train and X_val are reshaped as samples, time steps, features:
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_val = np.reshape(X_val, (X_val.shape[0], 1, X_val.shape[1]))
The LSTM network is defined as follows:
# Generate LSTM network
model = tf.keras.Sequential()
model.add(LSTM(4, input_shape=(1, previous)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, Y_train, epochs=20, batch_size=1, verbose=2)
As you can see, the mean squared error is used as the loss function since the cancellation variable in question is interval (i.e. can take on a wide range of values and is not necessarily restricted by any particular scale).
I can only speculate as I have not seen your data or results, but you may be going wrong by defining softmax as your activation function when it is not appropriate - I suspect this is the case as you are also using mean squared error as the loss measurement.
In the above example, the Dense layer does not specify an activation function per se.
In terms of how you might choose to validate whether your time series forecast with LSTM is effective, a potentially good idea is to compare the findings to that of a simpler time series model; e.g. ARIMA.
Using our example, ARIMA performed better when forecasting for Hotel 1, but LSTM performed better when forecasting for Hotel 2:
H1 Results
Reading ARIMA LSTM
MDA 0.86 0.8
RMSE 57.95 63.89
MFE -12.72 -54.25
H2 Results
Reading ARIMA LSTM
MDA 0.86 0.8
RMSE 274.07 95.28
MFE 156.32 38.65
Finally, when creating your datasets using the train and validation sets, you must also ensure that you are using the correct previous parameter, i.e. the number of time periods going back with which you choose to regress against the observations at time t.
For instance, you are using yesterday's availability - but you might find that the model is improved using the previous 5 or 10 days, for instance.
# Number of previous
previous = 5
X_train, Y_train = create_dataset(train, previous)
X_val, Y_val = create_dataset(val, previous)
In your situation, the first thing I would check is the use of the softmax activation function, and work from there.

Why my accuracy is always 0.2 in this simple code

I am new in this field and trying to re-run an example LSTM code copied from internet. The accuracy of the LSTM model is always 0.2 but the predicted output is totally correct which means the accuracy should be 1. Could anyone tell me why?
from numpy import array
from keras.models import Sequential, Dense, LSTM
length = 5
seq = array([i/float(length) for i in range(length)])
print(seq)
X = seq.reshape(length, 1, 1)
y = seq.reshape(length, 1)
# define LSTM configuration
n_neurons = length
n_batch = 1000
n_epoch = 1000
# create LSTM
model = Sequential()
model.add(LSTM(n_neurons, input_shape=(1, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
# train LSTM
model.fit(X, y, epochs=n_epoch, batch_size=n_batch)#, verbose=2)
train_loss, train_acc = model.evaluate(X, y)
print('Training set accuracy:', train_acc
result = model.predict(X, batch_size=n_batch, verbose=0)
for value in result:
print('%.1f' % value)
You are measuring accuracy, but you are training a regressor. This means you are having as output a float number and not a fixed categorical value.
If you change the last print to have 3 decimals of precision (print('%.3f' % value) ) you will see that the predicted values are really close to the ground truth but not exactly the same, therefore the accuracy is low:
0.039
0.198
0.392
0.597
0.788
For some reason, the accuracy being used (sparse_categorical_accuracy) is considering the 0.0 and 0.039 (or similar) as a hit instead of a miss, so that's why you are getting 20% instead of 0%.
If you change the sequence to not contain zero, you will have 0% accuracy, which is less confusing:
seq = array([i/float(length) for i in range(1, length+1)])
Finally, to correct this, you can use, for example, mae instead of accuracy as the metric, where you will see the error going down:
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
Other option would be to switch to a categorical framework (changing your floats to categorical values).
Hope this helps! I will edit the answer if I can dig into why the sparse_categorical_accuracy detects the 0 as a hit and not a miss.

Constant Output and Prediction Syntax with LSTM Keras Network

I am new to neural networks and have two, probably pretty basic, questions. I am setting up a generic LSTM Network to predict the future of sequence, based on multiple Features.
My training data is therefore of the shape (number of training sequences, length of each sequence, amount of features for each timestep).
Or to make it more specific, something like (2000, 10, 3).
I try to predict the value of one feature, not of all three.
Problem:
If I make my Network deeper and/or wider, the only output I get is the constant mean of the values to be predicted. Take this setup for example:
z0 = Input(shape=[None, len(dataset[0])])
z = LSTM(32, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z0)
z = LSTM(32, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z)
z = LSTM(64, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z)
z = LSTM(64, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z)
z = LSTM(128, activation='softsign', recurrent_activation='softsign')(z)
z = Dense(1)(z)
model = Model(inputs=z0, outputs=z)
print(model.summary())
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history= model.fit(trainX, trainY,validation_split=0.1, epochs=200, batch_size=32,
callbacks=[ReduceLROnPlateau(factor=0.67, patience=3, verbose=1, min_lr=1E-5),
EarlyStopping(patience=50, verbose=1)])
If I just use one layer, like:
z0 = Input(shape=[None, len(dataset[0])])
z = LSTM(4, activation='soft sign', recurrent_activation='softsign')(z0)
z = Dense(1)(z)
model = Model(inputs=z0, outputs=z)
print(model.summary())
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history= model.fit(trainX, trainY,validation_split=0.1, epochs=200, batch_size=32,
callbacks=[ReduceLROnPlateau(factor=0.67, patience=3, verbose=1, min_lr=1E-5),
EarlyStopping(patience=200, verbose=1)])
The predictions are somewhat reasonable, at least they are not constant anymore.
Why does that happen? Around 2000 samples not that many, but in the case of overfitting, I would expect the predictions to match perfectly...
EDIT: Solved, as stated in the comments, it's just that Keras always expects Batches: Keras
When I use:
`test=model.predict(trainX[0])`
to get the prediction for the first sequence, I get an dimension error:
"Error when checking : expected input_1 to have 3 dimensions, but got array with shape (3, 3)"
I need to feed in an array of sequences like:
`test=model.predict(trainX[0:1])`
This is a workaround, but I am not really sure, whether this has any deeper meaning, or is just a syntax thing...
This is because you have not normalised input data.
Any neural network model will initially have weights normalised around zero. Since your training dataset has all positive values, the model will try to adjust its weights to predict only positive values. However, the activation function (in your case softsign) will map it to 1. So the model can do nothing except adding the bias. That is why you are getting an almost constant line around the average value of the dataset.
For this, you can use a general tool like sklearn to pre-process your data. If you are using pandas dataframe, something like this will help
data_df = (data_df - data_df.mean()) / data_df.std()
Or to have the parameters in the model, you can consider adding batch normalization layer to your model

Python neural network accuracy - correct implementation?

I wrote a simple neural net/MLP and I'm getting some strange accuracy values and wanted to double check things.
This is my intended setup: features matrix with 913 samples and 192 features (913,192). I'm classifying 2 outcomes, so my labels are binary and have shape (913,1). 1 hidden layer with 100 units (for now). All activations will use tanh and all losses use l2 regularization, optimized with SGD
The code is below. It was writtin in python with the Keras framework (http://keras.io/) but my question isn't specific to Keras
input_size = 192
hidden_size = 100
output_size = 1
lambda_reg = 0.01
learning_rate = 0.01
num_epochs = 100
batch_size = 10
model = Sequential()
model.add(Dense(input_size, hidden_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(hidden_size, output_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Activation('tanh'))
sgd = SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd, class_mode="binary")
history = History()
model.fit(features_all, labels_all, batch_size=batch_size, nb_epoch=num_epochs, show_accuracy=True, verbose=2, validation_split=0.2, callbacks=[history])
score = model.evaluate(features_all, labels_all, show_accuracy=True, verbose=1)
I have 2 questions:
This is my first time using Keras, so I want to double check that the code I wrote is actually correct for what I want it to do in terms of my choice of parameters and their values etc.
Using the code above, I get training and test set accuracy hovering around 50-60%. Maybe I'm just using bad features, but I wanted to test to see what might be wrong, so I manually set all the labels and features to something that should be predictable:
labels_all[:500] = 1
labels_all[500:] = 0
features_all[:500] = np.ones(192)*500
features_all[500:] = np.ones(192)
So I set the first 500 samples to have a label of 1, everything else is labelled 0. I set all the features manually to 500 for each of the first 500 samples, and all other features (for the rest of the samples) get a 1
When I run this, I get training accuracy of around 65%, and validation accuracy around 0%. I was expecting both accuracies to be extremely high/almost perfect - is this incorrect? My thinking was that the features with extremely high values all have the same label (1), while the features with low values get a 0 label
Mostly I'm just wondering if my code/model is incorrect or whether my logic is wrong
thanks!
I don't know that library, so I can't tell you if this is correctly implemented, but it looks legit.
I think your problem lies with activation function - tanh(500)=1 and tanh(1)=0.76. This difference seem too small for me. Try using -1 instead of 500 for testing purposes and normalize your real data to something about [-2, 2]. If you need full real numbers range, try using linear activation function. If you only care about positive half on real numbers, I propose softplus or ReLU. I've checked and all those functions are provided with Keras.
You can try thresholding your output too - answer 0.75 when expecting 1 and 0.25 when expecting 0 are valid, but may impact you accuracy.
Also, try tweaking your parameters. I can propose (basing on my own experience) that you'd use:
learning rate = 0.1
lambda in L2 = 0.2
number of epochs = 250 and bigger
batch size around 20-30
momentum = 0.1
learning rate decay about 10e-2 or 10e-3
I'd say that learning rate, number of epochs, momentum and lambda are the most important factors here - in order from most to least important.
PS. I've just spotted that you're initializing your weights uniformly (is that even a word? I'm not a native speaker...). I can't tell you why, but my intuition tells me that this is a bad idea. I'd go with random initial weights.

Categories

Resources