I am trying to create a very simple neural network reading in information with the shape 1x2048 and to create a classification for two categories (object or not object). The graph structure however, deviates from what I believe to have coded. The dense layers should be included in the scope of "inner_layer" and should be receiving their input from the "input" placeholder. Instead, TF seems to be treating them as independent layers which do not receive any information from "input".
Also, when using trying to use tensorboard summaries I get an error telling me that I have not mentioned inserting inputs for the apparent placeholders of the dense layers. When omitting tensorboard, everything works as I expected it based on the code.
I have spent a lot of time trying to find the problem but I think I must be overlooking an something very basic.
The graph I get in tensorboard is on this image.
Which corresponds to the following code:
keep_prob = 0.5
# Graph Strcuture
## Placeholders for input
with tf.name_scope('input'):
x_ = tf.placeholder(tf.float32, shape = [None, transfer_values_train.shape[1]], name = "input1")
y_ = tf.placeholder(tf.float32, shape = [None, num_classes], name = "labels")
## Dense Layer one with 2048 nodes
with tf.name_scope('inner_layers'):
first_layer = tf.layers.dense(x_, units = 2048, activation=tf.nn.relu, name = "first_dense")
dropout_layer = tf.nn.dropout(first_layer, keep_prob, name = "dropout_layer")
#readout layer, without softmax
y_conv = tf.layers.dense(dropout_layer, units = 2, activation=tf.nn.relu, name = "second_dense")
# Evaluation and training
with tf.name_scope('cross_entropy'):
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels = y_ , logits = y_conv),
name = "cross_entropy_layer")
with tf.name_scope('trainer'):
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
with tf.name_scope('accuracy'):
prediction = tf.argmax(y_conv, axis = 1)
correct_prediction = tf.equal(prediction, tf.argmax(y_, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Does anyone have an idea why the graph is so different from what you would expect based on the code?
The graph rendering in tensorboard may be a bit confusing (initially), but it's correct. Take a look at this picture where I've left only the inner_layers part of your graph:
You may notice that:
The first_dense and second_dense are actually the name scopes themselves (generated by tf.layers.dense function; see also this question).
Their input/output tensors are inside the inner_layers scope and wire correctly to the dropout_layer. Here, in each of dense layers, live the corresponding linear ops: MatMul, BiasAdd, Relu.
Both scopes also include the variables (kernel and bias each), that are shown separately from inner_layers. They encapsulate the ops related specifically to variable, such as read, assign, initialize, etc. The linear ops in first_dense depend on the variable ops of first_dense, and second_dense likewise.
The reason for this separation is that in distributed settings the variables are manages by a different task called parameter server. It's usually run on a different device (CPU as opposed to GPU), sometimes even on a different machine. In other words, for tensorflow the variable management is by design different from matrix computation.
Having said that, I'd love to see a mode in tensorflow that would not split the scope into variables and ops and keep them coupled.
Other than this the graph perfectly matches the code.
In the book "Machine Learning with scikit-learn and Tensorflow" there's a code fragment I can't wrap my head around. Until that chapter, their models were only explicitly using layers - be it in a sequential fashion, or functional. But in the chapter 16, there's this:
import tensorflow_addons as tfa
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
final_outputs, final_state, final_sequence_lengths = decoder(
decoder_embeddings, initial_state=encoder_state,
Y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
And then he just runs the model in a standard way:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)
history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)
I have trouble understanding the code starting at line 7. The author is creating an Embedding layer which he immediately calls on encoder_inputs and decoder_inputs, then he does basically the same with the LSTM layer that he calls on the previously created encoder_embeddings and tensors returned by this operation are used in the code slightly below. What I don't get here is how are those tensors trained? It looks like he's not using the layers creating them in the model, but if so, then how come the embeddings are learned and the whole model converges?
To understand this overall flow, you must understand how things are made under the hood. Tensorflow uses graph execution when making the model. When you have passed [encoder_inputs, decoder_inputs, sequence_lengths] as an inputs and [Y_proba] as a output. The model doesn't immediately start the training, first it builds the model. So, what builds means here, the thing is it makes a computational graph first and then stores this computational graph. model.compile() does this for you, it makes a computational graph for you.
Let me explain it further let's suppose I wanna compute a + b = c and b + d = 2 and finally c * d = 6 using a computational graph, then first Tensorflow will make 3 nodes for it, how will it look like? see in the picture below.
As you are seeing in the picture above the same exact thing is done by TensorFlow when you pass your inputs and outputs. The picture above is the depiction of forward pass. Now the same graph would be used by Tensorflow to do the backward pass. See the figure below.
Now, first, the computational graph is made and then the same computational graph is used to compute the forward pass and backward pass.
The graph above computes the computational graph of your complete model. But how? in your case specifically. The model will ask how Y_prob comes here. The graph consists of the operations and tensors created between the inputs and outputs. The Embedding layer is created and applied to the inputs encoder_inputs and decoder_inputs to obtain encoder_embeddings and decoder_embeddings, respectively. The LSTM layer is applied to encoder_embeddings to produce encoder_outputs, state_h, and state_c. These tensors are then passed as inputs to the BasicDecoder layer, which combines the decoder_embeddings, encoder_state (constructed from state_h and state_c), and sequence_lengths to produce final_outputs, final_state, and final_sequence_lengths. Finally, the softmax function is applied to the rnn_output of final_outputs to produce the final output Y_proba.
All the entities which are mentioned in the paragraph above in quotes would be your intermediate nodes in a computational graph.
So, it will start with the inputs and bring it down to the Y-Prob. During graph computation, the weights of the model and other parameters are also initiated. The graph is made once, which is then easy to compute the forward pass and backward pass.
How do these layers are trained and optimized for convergence?
when you specify inputs=[encoder_inputs, decoder_inputs, sequence_lengths] and outputs=[Y_proba], the model knows the intermediate layers that are used to compute Y_proba from encoder_inputs, decoder_inputs and sequence_lengths. These intermediate layers are the Embedding and LSTM layers, as well as the TrainingSampler, LSTMCell, Dense and BasicDecoder layers. These layers are automatically included in the computation graph of the model, allowing the optimizer to update the parameters of these layers during training.
This is an example of the Keras Functional API. In this style of defining a model, you write the blueprints first, and then use it later. Think of it like wiring a circuit: while you're connecting things, there's no electricity flowing through them (the electricity corresponds to data in our metaphor). Later, you turn on the power source, and electricity flows through.
This is how the Functional API works as well. First, let's read the last line:
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
This says "Hey Keras, I need a model whose inputs are encoder_inputs, decoder_inputs, and sequence_lengths, and they will eventually produce Y_proba. The details of how they will produce this output is defined above. Let's look at the specific lines you're having trouble with:
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
The first of these says, "Keras, give me a layer that will produce embeddings". embeddings is a layer object. The next two lines are the wiring that we talked about: you can connect layers preemptively before data flows through them: that's the crux of the Functional API. So the second layer says, "Keras, encoder_inputs, which is an Input, will connect to (and go through) the Embedding layer I just created, and the result of that will be in a variable I call encoder_embeddings.
The rest of the code follows the same logic: you're connecting the wires together, before you compile your model and eventually fit it with the data.
I've trained the following model for some timeseries in Keras:
input_layer = Input(batch_shape=(56, 3864))
first_layer = Dense(24, input_dim=28, activation='relu',
first_layer = Dropout(0.3)(first_layer)
second_layer = Dense(12, activation='relu')(first_layer)
second_layer = Dropout(0.3)(second_layer)
out = Dense(56)(second_layer)
model_1 = Model(input_layer, out)
Then I defined a new model with the trained layers of model_1 and added dropout layers with a different rate, drp, to it:
input_2 = Input(batch_shape=(56, 3864))
first_dense_layer = model_1.layers[1](input_2)
first_dropout_layer = model_1.layers[2](first_dense_layer)
new_dropout = Dropout(drp)(first_dropout_layer)
snd_dense_layer = model_1.layers[3](new_dropout)
snd_dropout_layer = model_1.layers[4](snd_dense_layer)
new_dropout_2 = Dropout(drp)(snd_dropout_layer)
output = model_1.layers[5](new_dropout_2)
model_2 = Model(input_2, output)
Then I'm getting the prediction results of these two models as follow:
result_1 = model_1.predict(test_data, batch_size=56)
result_2 = model_2.predict(test_data, batch_size=56)
I was expecting to get completely different results because the second model has new dropout layers and theses two models are different (IMO), but that's not the case. Both are generating the same result. Why is that happening?
As I mentioned in the comments, the Dropout layer is turned off in inference phase (i.e. test mode), so when you use model.predict() the Dropout layers are not active. However, if you would like to have a model that uses Dropout both in training and inference phase, you can pass training argument when calling it, as suggested by François Chollet:
# ...
new_dropout = Dropout(drp)(first_dropout_layer, training=True)
# ...
Alternatively, If you have already trained your model and now want to use it in inference mode and keep the Dropout layers (and possibly other layers which have different behavior in training/inference phase such as BatchNormalization) active, you can define a backend function that takes the model's inputs as well as Keras learning phase:
from keras import backend as K
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
# to use it pass 1 to set the learning phase to training mode
outputs = func([input_arrays] + [1.])
your question has a simple solution in the latest version of Tensorflow. you can set the training argument of the call method to true.
you can run a code like the below code:
by using training=True TensorFlow automatically applies the Dropout layer in inference mode.
As there are already some working code solutions above, I will simply add a few more details regarding dropout during inference to prevent confusion.
Based on the original paper, Dropout layers play the role of turning off (setting gradients to zero) the neuron nodes during training to reduce overfitting. However, once we finish off with training and start testing the model, we do not 'touch' any neurons, thus, all the units are considered to make the decision when inferencing. This causes previously 'dead' neuron weights to be large than expected due to the usage of Dropout. To prevent this, a scaling factor is applied to balance the network node. To be more precise, if a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p during the prediction stage.
I'm having differences of the outputs when comparing a model with its stored protobuf version (via this conversion script). For debugging I'm comparing both layers respectively. For the weights and the actual layer output during a test sequence I receive the identical outputs, thus I'm not sure how to access the hidden layers.
Here is how I load the layers
input = graph.get_tensor_by_name("lstm_1_input_1:0")
layer1 = graph.get_tensor_by_name("lstm_1_1/kernel:0")
layer2 = graph.get_tensor_by_name("lstm_1_1/recurrent_kernel:0")
layer3 = graph.get_tensor_by_name("time_distributed_1_1/kernel:0")
output = graph.get_tensor_by_name("activation_1_1/div:0")
Here is the way what I thought to show the respective elements.
show weights:
with tf.Session(graph=graph) as sess:
print sess.run(layer1)
print sess.run(layer2)
print sess.run(layer3)
show outputs:
with tf.Session(graph=graph) as sess:
y_out, l1_out, l2_out, l3_out = sess.run([output, layer1, layer2, layer3], feed_dict={input: X_test})
With this code sess.run(layer1) == sess.run(layer1,feed_dict={input:X_test}) which shouldn't be true.
Can someone help me out?
When you run sess.run(layer1), you're telling tensorflow to compute the value of layer1 tensor, which is ...
layer1 = graph.get_tensor_by_name("lstm_1_1/kernel:0")
... according to your definition. Note that LSTM kernel is the weights variable. It does not depend on the input, that's why you get the same result with sess.run(layer1, feed_dict={input:X_test}). It's not like tensorflow is computing the output if the input is provided -- it's computing the specified tensor(s), in this case layer1.
When does input matter then? When there is a dependency on it. For example:
sess.run(output). It simply won't work without an input, or any tensor that will allow to compute the input.
The optimization op, such as tf.train.AdapOptimizer(...).minimize(loss). Running this op will change layer1, but it also needs the input to do so.
Maybe you can try TensorBoard and examine your graph to find the outputs of the hidden layers.
There have been some answers about adding L1-regularization to the Weights of one hidden. However what I want is not only the sparseness of Weight, but also the sparseness of the representation of one hidden layer. What I want is something like the code below. Is it feasible to be realized, or I need only to add L1-regularization on the Weights?
import tensorflow as tf
**HIDDEN** = tf.contrib.layers.dense(input_layer, n_nodes)
loss = meansq #or other loss calcuation
l1_regularizer = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
regularization_penalty = tf.contrib.layers.apply_regularization(l1_regularizer, **HIDDEN**)
regularized_loss = loss + regularization_penalty
This idea is from the sparse representation of the book Deep Learning written by Goodfellow and Bengio.
If you are using tf.contrib.layers, the fully_connected function accepts weights_regularizer argument, so your code should look like thus
l1 = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
hidden = tf.contrib.layers.fully_connected(inputs, n_nodes, weights_regularizer=l1)
That said, tf.contrib.layers has been mostly moved to the core API, so you should be using tf.layers.dense instead with kernel_regularizer argument.
The code above will regularize the weights in the layer. If you want to regularize both weights and the layer output, you can use the same tf.contrib.layers.l1_regularizer or create a different one with different parameters. Something like this should work for you:
l1 = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
hidden = tf.contrib.layers.fully_connected(inputs, n_nodes, weights_regularizer=l1)
hidden_reg = l1(hidden)
I'm trying to get into tensorflow, setting up a network and then feeding data to it. For some reason I end up with the error message ValueError: setting an array element with a sequence. I made a minimal example of what I'm trying to do:
import tensorflow as tf
K = 10
lchild = tf.placeholder(tf.float32, shape=(K))
rchild = tf.placeholder(tf.float32, shape=(K))
parent = tf.nn.tanh(tf.add(lchild, rchild))
input = [ tf.Variable(tf.random_normal([K])),
tf.Variable(tf.random_normal([K])) ]
with tf.Session() as sess :
print(sess.run([parent], feed_dict={ lchild: input[0], rchild: input[1] }))
Basically, I'm setting up a network with place holders and a sequence of input embeddings that I want to learn, and then I try to run the network, feeding the input embeddings into it. From what I can tell by searching for the error message, there might be something wrong with my feed_dict, but I can't see any obvious mismatches in eg. dimensionality.
So, what did I miss, or how did I get this completely backwards?
EDIT: I've edited the above to clarify that the input represents embeddings that need to be learned. I guess the question can be asked more sharply as: Is it possible to use placeholders for parameters?
The inputs should be numpy arrays.
So, instead of tf.Variable(tf.random_normal([K])), simply write np.random.randn(K) and everything should work as expected.
EDIT (The question was clarified after my answer):
It is possible to use placeholders as parameters but in a slightly different way. For example:
lchild = tf.placeholder(tf.float32, shape=(K))
rchild = tf.placeholder(tf.float32, shape=(K))
parent = tf.nn.tanh(tf.add(lchild, rchild))
loss = <some loss that depends on the parent tensor or lchild/rchild>
# Compute gradients with respect to the input variables
grads = tf.gradients(loss, [lchild, rchild])
inputs = [np.random.randn(K), np.random.randn(K)]
for i in range(<number of iterations>):
np_grads = sess.run(grads, feed_dict={lchild:inputs[0], rchild:inputs[1])
inputs[0] -= 0.1 * np_grads[0]
inputs[1] -= 0.1 * np_grads[1]
It is not however the best or easiest way to do this. The main problem with it is that at every iteration you need to copy numpy arrays in and out of the session (which is running potentially on a different device like GPU).
Placeholders generally are used to feed the data external to the model (like texts or images). The way to solve it using tensorflow utilities would be something like:
lchild = tf.Variable(tf.random_normal([K])
rchild = tf.Variable(tf.random_normal([K])
parent = tf.nn.tanh(tf.add(lchild, rchild))
loss = <some loss that depends on the parent tensor or lchild/rchild>
train_op = tf.train.GradientDescentOptimizer(loss).minimize(0.1)
for i in range(<number of iterations>):
# Retrieve the weights back to numpy:
np_lchild = sess.run(lchild)