To test a nonlinear sequential model using Keras,
I made some random data x1,x2,x3
and y = a + b*x1 + c*x2^2 + d*x3^3 + e (a,b,c,d,e are constants).
Loss is getting low really quickly but the model actually predicts a pretty wrong number. I've done it with a linear model with similar codes but it worked right. Maybe the Sequential model is designed wrong. Here is my code
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras import initializers
# y = 3*x1 + 5*x2 + 10
def gen_sequential_model():
model = Sequential([Input(3,name='input_layer')),
Dense(16, activation = 'relu', name = 'hidden_layer1', kernel_initializer=initializers.RandomNormal(mean = 0.0, stddev= 0.05, seed=42)),
Dense(16, activation = 'relu', name = 'hidden_layer2', kernel_initializer=initializers.RandomNormal(mean = 0.0, stddev= 0.05, seed=42)),
Dense(1, activation = 'relu', name = 'output_layer', kernel_initializer=initializers.RandomNormal(mean = 0.0, stddev= 0.05, seed=42)),
return model
def gen_linear_regression_dataset(numofsamples=500, a=3, b=5, c=7, d=9, e=11):
X = np.random.rand(numofsamples,3)
# y = a + bx1 + cx2^2 + dx3^3+ e
for idx in range(numofsamples):
X[idx][1] = X[idx][1]**2
X[idx][2] = X[idx][2]**3
coef = np.array([b,c,d])
bias = e
y = a + np.matmul(X,coef.transpose()) + bias
return X, y
def plot_loss_curve(history):
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
plt.title('model loss')
plt.legend(['train','test'],loc = 'upper right')
def predict_new_sample(model, x, a=3, b=5, c=7, d=9, e=11):
x = x.reshape(1,3)
y_pred = model.predict(x)[0][0]
y_actual = a + b*x[0][0] + c*(x[0][1]**2) + d*(x[0][2]**3) + e
print("y actual value: ", y_actual)
print("y pred value: ", y_pred)
model = gen_sequential_model()
X,y = gen_linear_regression_dataset(numofsamples=2000)
history =,y,epochs = 100, verbose=2, validation_split=0.3)
predict_new_sample(model, np.array([0.7,0.5,0.5]))
Epoch 99/100
44/44 - 0s - loss: 1.0631e-10 - val_loss: 9.9290e-11
Epoch 100/100
44/44 - 0s - loss: 1.0335e-10 - val_loss: 9.3616e-11
y actual value: 20.375
y pred value: 25.50001
Why is my predicted value so different from the real value?
Despite the improper use of activation = 'relu' in the last layer and the use of non-recommended kernel initializations, your model works fine, and the reported metrics are true and not flukes.
The problem is not in the model; the problem is that your data generating function does not return what you intend it to return.
First, in order to see that your model indeed learns what you have asked it to learn, let's run your code as is and then use your data generating function to produce a sample:
X, y_true = gen_linear_regression_dataset(numofsamples=1)
[[0.37454012 0.90385769 0.39221343]]
So for this particular X, the true output is 25.72962531; let's pass now this X to the model using your predict_new_sample function:
predict_new_sample(model, X)
# result:
y actual value: 22.134424269890232
y pred value: 25.729633
Well, the predicted output 25.729633 is extremely close to the true one as calculated above (25.72962531); thing is, your function thinks that the true output should be 22.134424269890232, which is demonstrably not the case.
What has happened is that your gen_linear_regression_dataset function returns the data X after you have calculated the squared and cubic components, which is not what you want; you want the returned data X to be before calculating the square & cube components, so that your model learns how to do this itself.
So, you need to change the function as follows:
def gen_linear_regression_dataset(numofsamples=500, a=3, b=5, c=7, d=9, e=11):
X_init = np.random.rand(numofsamples,3) # data to be returned
# y = a + bx1 + cx2^2 + dx3^3+ e
X = X_init.copy() # temporary data
for idx in range(numofsamples):
X[idx][1] = X[idx][1]**2
X[idx][2] = X[idx][2]**3
coef = np.array([b,c,d])
bias = e
y = a + np.matmul(X,coef.transpose()) + bias
return X_init, y
After modifying the function and re-training the model (you'll notice that the validation error ends up somewhat higher, ~ 1.3), we have
X, y_true = gen_linear_regression_dataset(numofsamples=1)
[[0.37454012 0.95071431 0.73199394]]
predict_new_sample(model, X)
# result:
y actual value: 25.729625308532768
y pred value: 25.443237
which is consistent. You will still not be getting perfect predictions of course, especially for unseen data (and remember that the error is now higher):
predict_new_sample(model, np.array([0.07,0.6,0.5]))
# result:
y actual value: 17.995
y pred value: 19.69147
As commented briefly above, you should really change your model to get rid from the kernel initializers (i.e. use the default, recommended ones) and use the correct activation function for your last layer:
def gen_sequential_model():
model = Sequential([Input(3,name='input_layer'),
Dense(16, activation = 'relu', name = 'hidden_layer1'),
Dense(16, activation = 'relu', name = 'hidden_layer2'),
Dense(1, activation = 'linear', name = 'output_layer'),
return model
You'll discover that you get a better validation error and better predictions:
predict_new_sample(model, np.array([0.07,0.6,0.5]))
# result:
y actual value: 17.995
y pred value: 18.272991
Nice catch from #desertnaut.
Just to add a few things uppon #desertnaut solution that seem to improve the results.
Scale your data (even that you always use 0-1 it seems to add a little boost)
Add Dropout between layers
Increase number of epochs (150 -200 ?)
Add reduce learning rate on plateau (give it some try)
Add more units to the layers
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=0.001)
def gen_sequential_model():
model = Sequential([
Dense(64, activation = 'relu', name = 'hidden_layer1'),
Dense(64, activation = 'relu', name = 'hidden_layer2'),
Dense(1, name = 'output_layer')
history =, y, epochs = 200, verbose=2, validation_split=0.2, callbacks=[reduce_lr])
predict_new_sample(model, x=np.array([0.07, 0.6, 0.5]))
y actual value: 17.995
y pred value: 17.710054
I have the following model:
def make_model(lr_rate=0.001):
inp = tf.keras.Input(shape=(height,width,depth))
x = tf.keras.layers.Conv2D(64, (4, 4),strides=(1,1),use_bias=False,padding='same',name='Conv1')(inp)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.Conv2D(32, (3, 3),strides=(1,1),use_bias=False,padding='same',name='Conv2')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(512)(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.Dense(S1 * S2 * 13,name='scores',activation = 'softmax')(x)
x = tf.keras.layers.Reshape((S1,S2,13))(x)
model = tf.keras.Model(inputs=inp,outputs = x)
optimizer_ = tf.keras.optimizers.Adam(learning_rate=lr_rate)
return model
Where I am using a custom weighted categorical crossentropy (CCC) loss, because the "normal" CCC does not work with weighting in my case because my output has too many dimensions. The loss itself seems to work fine when I call it on a test sample but for some reason it turns nan sometimes.:
def weighted_categorical_crossentropy(weights = the_weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
# scale predictions so that the class probas of each sample sum to 1
y_pred /= K.sum(y_pred[...,:], axis=-1, keepdims=True)
# clip to prevent NaN's and Inf's
y_pred = K.clip(y_pred[...,:], K.epsilon(), 1 - K.epsilon())
# calc
loss = y_true[...,:] * K.log(y_pred[...,:]) * weights
loss = K.sum(-K.sum(loss, 2))
return loss
return loss
During training the first few epochs work fine, but at some point (usually between epoch 5-10) I suddenly get a nan loss and the training crashes and throws the following exception:
InvalidArgumentError: Graph execution error:
Detected at node 'assert_greater_equal/Assert/AssertGuard/Assert' defined at (most recent call last):
Node: 'assert_greater_equal/Assert/AssertGuard/Assert'
assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (model_10/reshape_10/Reshape:0) = ] [[[[-nan -nan -nan...]]]...] [y (Cast_4/x:0) = ] [0]
[[{{node assert_greater_equal/Assert/AssertGuard/Assert}}]] [Op:__inference_train_function_70732]
I think it has something to do with my activation functions, that the loss turns nan which is then crashing due to the metrics.
So I really need to find out how to stop my loss returning nan. I have this issue both with ReLU and LeakyReLU, but with LeakyReLU my model trains better (or atleast quicker in the first few epochs). I have already checked my data and there is no nan values in the data.
i am using tensorflow/keras and i would like to use the input in the loss function
as per this answer here
Custom loss function in Keras based on the input data
I have created my loss function thusly
def custom_Loss_with_input(inp_1):
def loss(y_true, y_pred):
b = K.mean(inp_1)
return y_true - b
return loss
and set up the model with the layers and all ending like this
model = Model(inp_1, x)
model.compile(loss=custom_Loss_with_input(inp_1), optimizer= Ada)
return model
Nevertheless, i get the following error:
TypeError: Cannot convert a symbolic Keras input/output to a numpy array. This error may indicate that you're trying to pass a symbolic value to a NumPy call, which is not supported. Or, you may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model.
Any advice on how to eliminate this error?
Thanks in advance
You can use add_loss to pass external layers to your loss, in your case the input tensor.
Here an example:
def CustomLoss(y_true, y_pred, input_tensor):
b = K.mean(input_tensor)
return K.mean(K.square(y_true - y_pred)) + b
X = np.random.uniform(0,1, (1000,10))
y = np.random.uniform(0,1, (1000,1))
inp = Input(shape=(10,))
hidden = Dense(32, activation='relu')(inp)
out = Dense(1)(hidden)
target = Input((1,))
model = Model([inp,target], out)
model.add_loss( CustomLoss( target, out, inp ) )
model.compile(loss=None, optimizer='adam')[X,y], y=None, epochs=3)
If your loss is composed of different parts and you want to track them you can add different losses corresponding to the loss parts. In this way, the losses are printed at the end of each epoch and are stored in model.history.history. Remember that the final loss minimized during training is the sum of the various loss parts.
def ALoss(y_true, y_pred):
return K.mean(K.square(y_true - y_pred))
def BLoss(input_tensor):
b = K.mean(input_tensor)
return b
X = np.random.uniform(0,1, (1000,10))
y = np.random.uniform(0,1, (1000,1))
inp = Input(shape=(10,))
hidden = Dense(32, activation='relu')(inp)
out = Dense(1)(hidden)
target = Input((1,))
model = Model([inp,target], out)
model.add_loss(ALoss( target, out ))
model.add_metric(ALoss( target, out ), name='a_loss')
model.add_loss(BLoss( inp ))
model.add_metric(BLoss( inp ), name='b_loss')
model.compile(loss=None, optimizer='adam')[X,y], y=None, epochs=3)
To use the model in inference mode (removing the target from inputs):
final_model = Model(model.input[0], model.output)
I'm trying to implement a loss function by using the representations of the intermediate layers. As far as I know, the Keras backend custom loss function only accepts two input arguments(y_ture, and y-pred). How can I define a loss function with #tf.function and use it for a model that has been defined via Keras?
any help would be appreciated.
this a simple workaround to pass additional variables to your loss function. in our case, we pass the hidden output of one of our layers (x1). this output can be used to do something inside the loss function (I do a dummy operation)
def mse(y_true, y_pred, hidden):
error = y_true-y_pred
return K.mean(K.square(error)) + K.mean(hidden)
X = np.random.uniform(0,1, (1000,10))
y = np.random.uniform(0,1, 1000)
inp = Input((10,))
true = Input((1,))
x1 = Dense(32, activation='relu')(inp)
x2 = Dense(16, activation='relu')(x1)
out = Dense(1)(x2)
m = Model([inp,true], out)
m.add_loss( mse( true, out, x1 ) )
m.compile(loss=None, optimizer='adam')[X, y], y=None, epochs=3)
## final fitted model to compute predictions
final_m = Model(inp, out)
So my question is, if I have something like:
model = Model(inputs = input, outputs = [y1,y2])
model.compile(loss = my_loss ...)
I have only seen my_loss as a dictionary of independent losses and, then, the final loss is defined as the sum of those. But, can I define in a multitask model a loss function that take all the predicted/true values and then I can multiply them (for instance)?
This is the loss I am trying to define:
def my_loss(y_true1, y_true2, y_pred1, y_pred2):
final_loss = binary_crossentropy(y_true1, y_pred1) + y_true1 * categorical_crossentropy(y_true2, y_pred2)
return final_loss
Usually, your paramaters are y_true, y_pred in the loss function, where y_pred is either y1 or y2. But now I need both to compute the loss, so how can I define this loss function and pass all the parameters to the function: y_true1, y_true2, y_pred1, y_pred2.
My current model that I want to change its loss:
x = Input(shape=(n, ))
shared = Dense(32)(x)
sub1 = Dense(16)(shared)
sub2 = Dense(16)(shared)
y1 = Dense(1)(sub1, activation='sigmoid')
y2 = Dense(4)(sub2, activation='softmax')
model = Model(inputs = input, outputs = [y1,y2])
model.compile(loss = ['binary_crossentropy', 'categorical_crossentropy'] ...) #THIS LINE I WANT TO CHANGE IT
I'm not sure if I'm understanding correctly, but I'll try.
The loss function must contain both the predicted and the actual data -- it's a way to measure the error between what your model is predicting and the true data. However, the predicted and actual data do not need to be one-dimensional. You can make y_pred a tensor that contains both y_pred1 and y_pred2. Likewise, y_true can be a tensor that contains both y_true1 and y_true2.
As far as I know, loss functions should return a single number. That's why loss functions often have a mean or a sum to add up all of the losses for individual data points.
Here's an example of mean square error that will work for more than 1D:
import keras.backend as K
def my_loss(y_true, y_pred):
# this example is mean squared error
# works if if y_pred and y_true are greater than 1D
return K.mean(K.square(y_pred - y_true))
Here's another example of a loss function that I think is closer to your question (although I cannot comment on whether or not it's a good loss function):
def my_loss(y_true, y_pred):
# calculate mean(abs(y_pred1*y_pred2 - y_true1*ytrue2))
# this will work for 2D inputs of y_pred and y_true
return K.mean(K.abs(, axis = 1) -, axis = 1)))
You can concatenate two outputs into a single tensor with keras.layers.Concatenate. That way you can still have a loss function with only two arguments.
In the model you wrote above, the y1 output shape is (None, 1) and the y2 output shape is (None, 4). Here's an example of how you could write your model so that the output is a single tensor that concatenates y1 and y1 into a shape of (None, 5):
from keras import Model
from keras.layers import Input, Dense
from keras.layers import Concatenate
input_layer = Input(shape=(n, ))
shared = Dense(32)(input_layer)
sub1 = Dense(16)(shared)
sub2 = Dense(16)(shared)
y1 = Dense(1, activation='sigmoid')(sub1)
y2 = Dense(4, activation='softmax')(sub2)
mergedOutput = Concatenate()([y1, y2])
Below, I show an example for how you could rewrite your loss function. I wasn't sure which of the 5 columns of the output to call y_true1 vs. y_true2, so I guessed that y_true1 was column 1 and y_true2 was the remaining 4 columns. The same column structure would apply to y_pred1 and y_pred2.
from keras import losses
def my_loss(y_true, y_pred):
final_loss = (losses.binary_crossentropy(y_true[:, 0], y_pred[:, 0]) +
y_true[:, 0] *
losses.categorical_crossentropy(y_true[:, 1:], y_pred[:,1:]))
return final_loss
Finally, you can compile the model without any major changes from normal:
model.compile(optimizer='adam', loss=my_loss)
I'm building a simple neural network that takes 3 values and gives 2 outputs.
I'm getting an accuracy of 67.5% and an average cost of 0.05
I have a training dataset of 1000 examples and 500 testing examples. I plan on making a larger dataset in the near future.
A little while ago I managed to get an accuracy of about 82% and sometimes a bit higher, but the cost was quite high.
I've been experimenting with adding another layer which is currently in the model and that is the reason I have got the loss under 1.0
I'm not sure what is going wrong, I'm new to Tensorflow and NNs in general.
Here is my code:
import tensorflow as tf
import numpy as np
import sys
sys.path.insert(0, '.../Dataset/Testing/')
sys.path.insert(0, '.../Dataset/Training/')
#other files
from TestDataNormaliser import *
from TrainDataNormaliser import *
learning_rate = 0.01
trainingIteration = 10
batchSize = 100
displayStep = 1
x = tf.placeholder("float", [None, 3])
y = tf.placeholder("float", [None, 2])
#layer 1
w1 = tf.Variable(tf.truncated_normal([3, 4], stddev=0.1))
b1 = tf.Variable(tf.zeros([4]))
y1 = tf.matmul(x, w1) + b1
#layer 2
w2 = tf.Variable(tf.truncated_normal([4, 4], stddev=0.1))
b2 = tf.Variable(tf.zeros([4]))
#y2 = tf.nn.sigmoid(tf.matmul(y1, w2) + b2)
y2 = tf.matmul(y1, w2) + b2
w3 = tf.Variable(tf.truncated_normal([4, 2], stddev=0.1))
b3 = tf.Variable(tf.zeros([2]))
y3 = tf.nn.sigmoid(tf.matmul(y2, w3) + b3) #sigmoid
#wO = tf.Variable(tf.truncated_normal([2, 2], stddev=0.1))
#bO = tf.Variable(tf.zeros([2]))
a = y3 #tf.nn.softmax(tf.matmul(y2, wO) + bO) #y2
a_ = tf.placeholder("float", [None, 2])
#cost function
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(a)))
#cross_entropy = -tf.reduce_sum(y*tf.log(a))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
init = tf.global_variables_initializer() #initialises tensorflow
with tf.Session() as sess: #runs the initialiser
writer = tf.summary.FileWriter(".../Logs")
merged_summary = tf.summary.merge_all()
for iteration in range(trainingIteration):
avg_cost = 0
totalBatch = int(len(trainArrayValues)/batchSize) #1000/100
#totalBatch = 10
for i in range(batchSize):
start = i
end = i + batchSize #100
xBatch = trainArrayValues[start:end]
yBatch = trainArrayLabels[start:end]
#feeding training data, feed_dict={x: xBatch, y: yBatch})
i += batchSize
avg_cost +=, feed_dict={x: xBatch, y: yBatch})/totalBatch
if iteration % displayStep == 0:
print("Iteration:", '%04d' % (iteration + 1), "cost=", "{:.9f}".format(avg_cost))
print("Training complete")
predictions = tf.equal(tf.argmax(a, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(predictions, "float"))
print("Accuracy:", accuracy.eval({x: testArrayValues, y: testArrayLabels}))
A few important notes:
You don't have non-linearities between your layers. This means you're training a network which is equivalent to a single-layer network, just with a lot of wasted computation. This is easily solved by adding a simple non-linearity, e.g. tf.nn.relu after each matmul/+ bias line, e.g. y2 = tf.nn.relu(y2) for all bar the last layer.
You are using a numerically unstable cross entropy implementation. I'd encourage you to use tf.nn.sigmoid_cross_entropy_with_logits, and removing your explicit sigmoid call (the input to your sigmoid function is what is generally referred to as the logits, or 'logistic units').
It seems you are not shuffling your dataset as you go. This could be particularly bad given your choice of optimizer, which leads us to...
Stochastic gradient descent is not great. For a boost without adding too much complication, consider using MomentumOptimizer instead. AdamOptimizer is my go-to, but play around with them.
When it comes to writing clean, maintainable code, I'd also encourage you to consider the following:
Use higher level APIs, e.g. tf.layers. It's good you know what's going on at a variable level, but it's easy to make a mistake with all that replicated code, and the default values with the layer implementations are generally pretty good
Consider using the API for your data input. It's a bit scary at first, but it handles a lot of things like batching, shuffling, repeating epochs etc. very nicely
Consider using something like the tf.estimator.Estimator API for handling session runs, summary writing and evaluation.
With all those changes, you might have something that looks like the following (I've left your code in so you can roughly see the equivalent lines).
For graph construction:
def get_logits(features):
"""tf.layers API is cleaner and has better default values."""
# #layer 1
# w1 = tf.Variable(tf.truncated_normal([3, 4], stddev=0.1))
# b1 = tf.Variable(tf.zeros([4]))
# y1 = tf.matmul(x, w1) + b1
x = tf.layers.dense(features, 4, activation=tf.nn.relu)
# #layer 2
# w2 = tf.Variable(tf.truncated_normal([4, 4], stddev=0.1))
# b2 = tf.Variable(tf.zeros([4]))
# y2 = tf.matmul(y1, w2) + b2
x = tf.layers.dense(x, 4, activation=tf.nn.relu)
# w3 = tf.Variable(tf.truncated_normal([4, 2], stddev=0.1))
# b3 = tf.Variable(tf.zeros([2]))
# y3 = tf.nn.sigmoid(tf.matmul(y2, w3) + b3) #sigmoid
# N.B Don't take a non-linearity here.
logits = tf.layers.dense(x, 1, actiation=None)
# remove unnecessary final dimension, batch_size * 1 -> batch_size
logits = tf.squeeze(logits, axis=-1)
return logits
def get_loss(logits, labels):
"""tf.nn.sigmoid_cross_entropy_with_logits is numerically stable."""
# #cost function
# cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(a)))
return tf.nn.sigmoid_cross_entropy_with_logits(
logits=logits, labels=labels)
def get_train_op(loss):
"""There are better options than standard SGD. Try the following."""
learning_rate = 1e-3
# optimizer = tf.train.GradientDescentOptimizer(learning_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate)
# optimizer = tf.train.AdamOptimizer(learning_rate)
return optimizer.minimize(loss)
def get_inputs(feature_data, label_data, batch_size, n_epochs=None,
Get features and labels for training/evaluation.
feature_data: numpy array of feature data.
label_data: numpy array of label data
batch_size: size of batch to be returned
n_epochs: number of epochs to train for. None will result in repeating
forever/until stopped
shuffle: bool flag indicating whether or not to shuffle.
dataset =
(feature_data, label_data))
dataset = dataset.repeat(n_epochs)
if shuffle:
dataset = dataset.shuffle(len(feature_data))
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
For session running you could use this like you have (what I'd call 'the hard way')...
features, labels = get_inputs(
trainArrayValues, trainArrayLabels, batchSize, n_epochs, shuffle=True)
logits = get_logits(features)
loss = get_loss(logits, labels)
train_op = get_train_op(loss)
init = tf.global_variables_initializer()
# monitored sessions have the `should_stop` method, which works with datasets
with tf.train.MonitoredSession() as sess:
while not sess.should_stop():
# get both loss and optimizer step in the same session run
loss_val, _ =[loss, train_op])
# save variables etc, do evaluation in another graph with different inputs?
but I think you're better off using a tf.estimator.Estimator, though some people prefer tf.keras.Models.
def model_fn(features, labels, mode):
logits = get_logits(features)
loss = get_loss(logits, labels)
train_op = get_train_op(loss)
predictions = tf.greater(logits, 0)
accuracy = tf.metrics.accuracy(labels, predictions)
return tf.estimator.EstimatorSpec(
mode=mode, loss=loss, train_op=train_op,
eval_metric_ops={'accuracy': accuracy}, predictions=predictions)
def train_input_fn():
return get_inputs(trainArrayValues, trainArrayLabels, batchSize)
def eval_input_fn():
return get_inputs(
testArrayValues, testArrayLabels, batchSize, n_epochs=1, shuffle=False)
# Where variables and summaries will be saved to
model_dir = './model'
estimator = tf.estimator.Estimator(model_fn, model_dir)
estimator.train(train_input_fn, max_steps=max_steps)
Note if you use estimators the variables will be saved after training, so you won't need to re-train each time. If you want to reset, just delete the model_dir.
I see that you are using a softmax loss with sigmoidal activation functions in the last layer. Now let me explain the difference between softmax activations and sigmoidal.
You are now allowing the output of the network to be y=(0, 1), y=(1, 0), y=(0, 0) and y=(1, 1). This is because your sigmoidal activations "squish" each element in y between 0 and 1. Your loss function, however, assumes that your y vector sums to one.
What you need to do here is either to penalise the sigmoidal cross entropy function, which looks like this:
Or, if you want a to sum to one, you need to use softmax activations in your final layer (to get your a's) instead of sigmoids, which is implemented like this
exp_out = tf.exp(y3)
a = exp_out/tf reduce_sum(exp_out)
Ps. I'm using my phone on a train so please excuse typos