I have data set that has two inputs x1,x2 and output that has 1 binary value (0,1) and 45 real numbers (output vector has 46 attibutes in summary). I would like to use different loss functions for this 1 binary value and 45 real numbers, namely binary crossentropy and mean squared error.
My knowledge of Keras is very limited, so I am not even sure if this is the architecture I want. Is this the right way of doing this?
first, preprocessing:
# load dataset
dataframe = pandas.read_csv("inputs.csv", delim_whitespace=True,header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:2]
Y = dataset[:,3:]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,
random_state=123)
y_train_L, y_train_R = y_train[:,0], y_train[:,1:]
y_train_L = y_train_L.reshape(-1,1)
scalarX, scalarY_L, scalarY_R = MinMaxScaler(), MinMaxScaler(), MinMaxScaler()
scalarX.fit(x_train)
scalarY_L.fit(y_train_L)
scalarY_R.fit(y_train_R)
x_train = scalarX.transform(x_train)
y_train_L = scalarY_L.transform(y_train_L)
y_train_R = scalarY_R.transform(y_train_R)
where y_train_L is left part are just binary values and y_train_R are real numbers. I had to split them because when defining architecture:
# define and fit the final model
inputs = Input(shape=(x_train.shape[1],))
first =Dense(46, activation='relu')(inputs)
#last
layer45 = Dense(45, activation='linear')(first)
layer1 = Dense(1, activation='tanh')(first)
out = [layer1,layer45]
#end last
model = Model(inputs=inputs,outputs=out)
model.compile(loss=['binary_crossentropy','mean_squared_error'], optimizer='adam')
model.fit(x_train, [y_train_L,y_train_R], epochs=1000, verbose=1)
Xnew = scalarX.transform(x_test)
y_test_L, y_test_R = y_test[:,0], y_test[:,1:]
y_test_L = y_test_L.reshape(-1,1)
y_test_L=scalarY_L.transform(y_test_L)
y_test_R=scalarY_R.transform(y_test_R)
# make a prediction
ynew = model.predict(Xnew)
loss=['binary_crossentropy','mean_squared_error'] expects two different arrays in model.fit(x_train, [y_train_L,y_train_R])
then i have to do all the 'funny' tricks to get predicted values and compare them next to each other because ynew = model.predict(Xnew) return list of two lists, one for binary values and one for real numbers.
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
print("SCALED VALUES")
for i in range(len(Xnew)):
print("X=%s\n P=%s,%s\n A=%s,%s" % (Xnew[i], ynew[0][i], ynew[1][i], y_test_L[i], y_test_R[i]))
inversed_X_test = scalarX.inverse_transform(Xnew)
inversed_Y_test_L = scalarY_L.inverse_transform(y_test_L)
inversed_Y_test_R = scalarY_R.inverse_transform(y_test_R)
inversed_y_predicted_L = scalarY_L.inverse_transform(ynew[0])
inversed_y_predicted_R = scalarY_R.inverse_transform(ynew[1])
print("REAL VALUES")
for i in range(len(inversed_X_test)):
print("X=%s\n P=%s,%s\n A=%s,%s" % (inversed_X_test[i], inversed_y_predicted_L[i],inversed_y_predicted_R[i], inversed_Y_test_L[i],inversed_Y_test_R[i]))
questions:
Can I achieve this in cleaner way?
How can I measure loss? I would like to create chart of loss values during trening.
1) The way you define your model seems correct and there is no 'cleaner' way of doing it (I would argue that Keras' functional API is as clean as it gets)
2) To visualize training loss, store the history of training in a variable:
history = model.fit(...)
This history object will contain the train and validation losses for each epoch, you can use itto make plots.
3) In your classification output (layer1), you want to use a sigmoid activation instead of tanh. The sigmoid function returns values between 0 and 1, tanh returns values between -1 and 1. Your binary_crossentropy loss function expects the former.
Related
I'm trying to test an LSTM model on the following time series:
As you can see it is stationary and periodic (not that this matters, but it should be pretty easy for a neural net to pick up). This is in fact a coordinate of a simple pendulum vs time.
The steps for preprocessing are the following:
Scale this array using MinMaxScaler.
My model will predict x[t] using x[t-1] up to x[t-5]
scaler = MinMaxScaler()
X = scaler.fit_transform(x.reshape(-1,1))
lookback = 5
features=1
model_input, labels = [],[]
for i in range(X.shape[0]-lookback):
model_input.append(X[i:i+lookback])
labels.append(X[i+lookback])
model_input = np.asarray(model_input)
labels = np.asarray(labels)
model_input.shape, labels.shape
which returns ((495,5,1), (495,1)) this makes sense because my t has 500 steps.
Then I build and train the model:
#train on the first 400 steps, predict on the next 100
train_in, train_out = model_input[:400], labels[:400]
test_out = labels[400:]
model = Sequential()
model.add(LSTM(64, input_shape = (lookback, features))) #input shape is (batch, timesteps, features)
model.add(Dense(1))
model.compile(optimizer = 'adam', loss = 'mse')
#train
model.fit(train_in, train_out, epochs = 30)
Finally, I want to test my model. I don't see the point of using predict here. I want to use the last 5 coordinates in the training set to generate a prediction for the first step in the testing set. Then, I will use this prediction as an input to calculate the next position. And so on...
Here is the code:
#now we make predictions
preds = []
preds_input = train_in[-1:] #to make the first prediction on the test set, we start with the last batch of the training set
for i in range(test_out.shape[0]):
#the next step is the prediction on preds_input
next_step = model.predict(preds_input, verbose=0)
#append next_step to preds
preds.append(next_step)
#append next_step to preds_input and remove the first value so it keeps shape 1,5,1
preds_input = np.append(preds_input,next_step.reshape(1,1,1), axis=1)
preds_input = preds_input[:, 1:, :,]
I then rescaled the predictions and the testing data using inverse_transform and plotted the results.
This is what I got
I'm not able to understand why my model performed so poorly. The pattern is simple and it should've been able to pick it up. Any help would be great!
I want to train a Siamese Network to compare vectors for similarity.
My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):
import pandas as pd
# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())
y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())
# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val
# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]
assert X_left_train.shape == X_right_train.shape
# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")
print(y_test.value_counts())
X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]
returns
v0 v1 v2 ... v397 v398 v399 class
0 0.003615 0.013794 0.030388 ... -0.093931 0.106202 0.034870 0.0
1 0.018988 0.056302 0.002915 ... -0.007905 0.100859 -0.043529 0.0
2 0.072516 0.125697 0.111230 ... -0.010007 0.064125 -0.085632 0.0
3 0.051016 0.066028 0.082519 ... 0.012677 0.043831 -0.073935 1.0
4 0.020367 0.026446 0.015681 ... 0.062367 -0.022781 -0.032091 0.0
1.0 1060
0.0 923
Name: class, dtype: int64
1.0 354
0.0 308
Name: class, dtype: int64
The rest of my script is as follows:
import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model
def euclidean_distance(vectors):
"""
Find the Euclidean distance between two vectors.
"""
x, y = vectors
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
# Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
return K.sqrt(K.maximum(sum_square, K.epsilon()))
def contrastive_loss(y_true, y_pred):
"""
Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.
See:
* https://gombru.github.io/2019/04/03/ranking_loss/
"""
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
def accuracy(y_true, y_pred):
"""
Compute classification accuracy with a fixed threshold on distances.
"""
return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
input1 = Input(input_dim, name="encoder")
x = input1
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu", name="Embeddings")(x)
return Model(input1, x)
def build_siamese_model(input_dim: int):
shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)
left_input = Input(input_dim)
right_input = Input(input_dim)
# Since this is a siamese nn, both sides share the same network.
encoded_l = shared_network(left_input)
encoded_r = shared_network(right_input)
# The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
return siamese_net
model = build_siamese_model(X_left_train.shape[1])
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
[X_left_train, X_right_train],
y_train,
validation_data=([X_left_val, X_right_val], y_val),
epochs=100,
callbacks=[es_callback],
verbose=1,
)
I have plotted the contrastive loss vs epoch and model accuracy vs epoch:
The validation line is almost flat, which seems odd to me (overfitted?).
After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:
Somehow it looks better, but yields bad predictions as well.
My questions are:
Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?
The output layer of my Siamese Network is as follows:
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
but someone over the internet suggested this instead:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance) # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.
EDIT:
After following #PlzBePython suggestions, I come up with the following base network:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
Thank you for your help!
This is less of an answer and more writing my thoughts down and hoping they can help find an answer.
In general, everything you do seems pretty reasonable to me.
Regarding your Questions:
1:
Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.
2:
In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:
While Contrastive Loss (margin = 1) is:
So you basically swap a log function for a hinge function.
The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.
What I would try is:
increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters
Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.
EDIT
I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.
I have a pre-trained PyTorch model. I need to calculate the gradient of the loss with respect to the network's inputs using this model (without training again and only using the pre-trained model).
I wrote the following code, but I am not sure it is correct or not.
test_X, test_y = load_data(mode='test')
testset_original = MyDataset(test_X, test_y, transform=default_transform)
testloader = DataLoader(testset_original, batch_size=32, shuffle=True)
model = MyModel(device=device).to(device)
checkpoint = torch.load('checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
gradient_losses = []
for i, data in enumerate(testloader):
inputs, labels = data
inputs= inputs.to(device)
labels = labels.to(device)
inputs.requires_grad = True
output = model(inputs)
loss = loss_function(output)
loss.backward()
gradient_losses.append(inputs.grad)
My question is, does this list gradient_losses actually storing what I wish to store? If not, what is the correct way to do that?
does this list gradient_losses actually storing what I wish to store?
Yes, if you are looking to get the derivative of the loss with respect to the input then that seems to be the correct way to do it. Here is minimal example, take f(x) = a*x. Then df/dx = a.
>>> x = torch.rand(10, requires_grad=True)
>>> y = torch.rand(10)
>>> a = torch.tensor([3.], requires_grad=True)
>>> loss = a*x - y
>>> loss.mean().backward()
>>> x.grad
tensor([0.3000, 0.3000, ..., 0.3000, 0.3000])
Which, in this case is equal to a / len(x)
Do note, each gradient you extract with input.grad will be averaged over the whole batch, and won't be a gradient over each individual input.
Also, you don't need to .clone() your input gradients as they are not part of the model and won't get zeroed by model.zero_grad().
I want to setup a keras model (tensorflow backend) for a multiclassification problem with 4 different classes. I have both labeled and unlabeled data.
I have worked out the case in which I only train with the labeled data and my model looks something like this:
# create model
inputs = keras.Input(shape=(len(config.variables), ))
X = layers.Dense(units=200, activation="relu")(inputs)
output = layers.Dense(units=4, activation="softmax", name="output")(X)
model = keras.Model(inputs=inputs, outputs=output)
model.compile(optimizer=optimizers.Adam(1e-4), loss=loss_function, metrics=["accuracy"])
# train model
model.fit(
x=train_data,
y=train_class_labels,
batch_size=200,
epochs=200,
verbose=2,
validation_split=0.2,
sample_weight = class_weights
)
I have functioning models with to different losses namely categorical_crossentropy and sparse_categorical_crossentropy, and depending on the loss function my train_class_labels where in one-hot representation (e.g. [ [0,1,0,0], [0,0,0,1], ...]) or in the integer representation (e.g. [0,0,2,1,0,3, ...]) and everything worked fine. class_weights is some weight vector ([0.78, 1,34, ...])
Now for my further plans I need to include the unlabeled data in the training process but I need it to be ignored by the loss function.
What I have tried:
setting the labels from the unlabeled data to [0,0,0,0] when using categorical_crossentropy as a loss, because i thought then my unlabeled data would be ignored by the loss function. Somehow this changed the predictions after training.
I also tried setting the weights from the unlabeled data to 0 but that did have an effect either
I concluded that I need to somehow mark me unlabeled data and customize my loss function so that it can be told to ignore those samples. Something like
def custom_loss(y_true, y_pred):
if y_true == labeled data:
return normal loss function
if y_true == unlabeled data:
return 0
Those are some snippets that I have found but they do not seem to work:
def custom_loss(y_true, y_pred):
loss = losses.sparse_categorical_crossentropy(y_true, y_pred)
return K.switch(K.flatten(K.equal(y_true, -1)), K.zeros_like(loss), loss)
def custom_loss2(y_true, y_pred):
idx = tf.not_equal(y_true, -1)
y_true = tf.boolean_mask(y_true, idx)
y_pred = tf.boolean_mask(y_pred, idx)
return losses.sparse_categorical_crossentropy(y_true, y_pred)
In those examples I set the labels from the unlabeled data to -1 so train_class_labels would look something like this: [0,-1,2,0,3, ... ]
But when using the first loss function I just get Nans and when using the second one I get the following error:
Invalid argument: logits and labels must have the same first dimension, got logits shape [1,5000] and labels shape [5000]
I think that setting the labels to [0,0,0,0] would be just fine. Because the loss is calculated by sum of the log losses of your instances per class (in your case the loss would be 0 for instances with no label).
I don't understand why you are inserting non labeled data in your training in a supervised setting.
I think that the differences that you obtain are due to the batch size and to the gradient step. If there are instances that do not contribute to the gradient descent, the loss calculated would be different than before, and then you get the difference in prediction.
Basically there would be less informative instances per batch.
If you use as batch size the size of all the dataset there would be no difference from a previous training without the unlabeled instances (but always with a training with batch size = size of the dataset)
I'm currently working with a time series dataset of 46 lines about meteorological measurements on approximately each 3 hours by day during one week. My explanatory variables (X) is composed of 26 variables and some variable has different units of measurement (degree, minimeters, g/m3 etc.). My variable to explain (y) is composed of only one variable temperature.
My goal is to predict temperature (y) on a slot of 12h-24h with the ensemble of variables (X)
For that I used Keras Tensorflow and Python, with MLP regressor model :
X = df_forcast_cap.loc[:, ~df_forcast_cap.columns.str.startswith('l')]
X = X.drop(['temperature_Y'],axis=1)
y = df_forcast_cap['temperature_Y']
y = pd.DataFrame(data=y)
# normalize the dataset X
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit_transform(X)
normalized = scaler.transform(X)
# normalize the dataset y
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit_transform(y)
normalized = scaler.transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# define base model
def norm_model():
# create model
model = Sequential()
model.add(Dense(26, input_dim=26, kernel_initializer='normal', activation='relu'))# 30 is then number of neurons
#model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=norm_model, epochs=(100), batch_size=5, verbose=1)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold)
print(results)
[-0.00454741 -0.00323181 -0.00345096 -0.00847261 -0.00390925 -0.00334816
-0.00239754 -0.00681044 -0.02098541 -0.00140129]
# invert predictions
X_train = scaler.inverse_transform(X_train)
y_train = scaler.inverse_transform(y_train)
X_test = scaler.inverse_transform(X_test)
y_test = scaler.inverse_transform(y_test)
results = scaler.inverse_transform(results)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
Results: -0.01 (0.01) MSE
(1) I read that cross-validation is not adapted for time series prediction. So, I'm wondering which others techniques exist and which one is more adapted to time-series.
(2) In a second place, I decided to normalize my data because my X dataset is composed of different metrics (degree, minimeters, g/m3 etc.) and my variable to explain y is in degree. In this way, I know that have to deal with a more complicated interpretation of the MSE because its result won't be in the same unity that my y variable. But for the next step of my study I need to save the result of the y predicted (made by the MLP model) and I need that these values be in degree. So, I tried to inverse the normalization but without success, when I print my results, the predicted values are still in normalized format (see in my code above). Does anyone see my mistake.s ?
The model that you present above is looking at a single instance of 26 measurements to make a prediction. From your description it seems that you would like to make predictions from a sequence of these measurements. I'm not sure if I fully understood the description but I'll assume that you have a sequence of 46 measurements, each with 26 values that you believe should be good predictors of the temperature. If that is the case, the input shape of your model should be (46, 26,). The 46 here is called time_steps, 26 is the number of features.
For a time series you need to select a model design. There are 2 approaches: a recurrent network or a convolutional network (or a mixture of the 2nd). A convolutional network is typically used to detect patterns in the input data which may be located somewhere in the data. For instance, suppose you want to detect a given shape in an image. Convolutional Networks are a good starting point. Recurrent networks, update their internal state after each time step. They can detect patterns as well as a convolutional network, but you can think of them as being less position independent.
Simple example of a convolutional approach.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential, Model
average_tmp = 0.0
model = Sequential([
InputLayer(input_shape=(46,26,)),
Conv1D(16, 4),
Conv1D(32, 4),
Conv1D(64, 2),
Conv1D(128, 4),
MaxPooling1D(),
Flatten(),
Dense(256, activation='relu'),
Dense(1, bias_initializer=keras.initializers.Constant(average_tmp)),
])
model.compile('adam', 'mse')
model.summary()
A mixed approach, would replace the ```Flatten`` layer above with an LSTM node. That would probably be a reasonable starting point to start experimenting.
(1) I read that cross-validation is not adapted for time series prediction. So, I'm wondering which others techniques exist and which one is more adapted to time-series.
cross validation is a technique that is very well suited for this problem. If you try the example model above, I can almost guarantee that it will overfit your dataset very significantly. cross-validation can help you determine the right regularisation parameters for your model in order to avoid overfitting.
Examples of regularisation techniques that you probably want to consider:
Saving the model weights at the epoch with lower validation score.
Dropout and/or BatchNormalization.
kernel regularisation.
(2) In a second place, I decided to normalize my data because my X dataset is composed of different metrics (degree, minimeters, g/m3 etc.) and my variable to explain y is in degree.
Good call. It will avoid training cycles of your model trying to discover the bias at very high values from the random initialisation.
In this way, I know that have to deal with a more complicated interpretation of the MSE because its result won't be in the same unity that my y variable.
This is orthogonal. The inputs are not assumed to be in the same unit as y. We assume in a DNN that we can create a combination of linear transformation of weights (plus non-linear activations). That has no implicit assumption of units.
But for the next step of my study I need to save the result of the y predicted (made by the MLP model) and I need that these values be in degree. So, I tried to inverse the normalization but without success, when I print my results, the predicted values are still in normalized format (see in my code above). Does anyone see my mistake.s ?
scaler.inverse_transform(results) should do the trick.
It doesn't make sense to inverse transform the inputs X_ and Y_. And it would probably help you keep your code straight to not use the same variable name for both the X and Y scalers.
It is also possible to refrain from scaling Y. If you choose to do so, I'd suggest that you initialise the output layer bias with the mean of the Ys.