I am trying to construct a Keras model model_B that outputs the output of another Keras model model_A. Now, the output of model_A is computed from the concatenation of several tensors coming from multiple Keras embedding layers with different vocabulary sizes. Models model_A and model_B are essentially the same.
Problem: When I train model_A, everything works fine. However, when I train model_B on the same dataset, I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError:
indices[1] = 3 is not in [0, 2) [[{{node model_1/embedding_1/embedding_lookup}}]]
Essentially, the error is saying that the index of a word is outside of the expected vocabulary, but this is not the case. Could someone clarify why this happens?
Here is a reproducible example of the problem:
from keras.layers import Input, Dense, Lambda, Concatenate, Embedding
from keras.models import Model
import numpy as np
# Constants
A = 2
vocab_sizes = [2, 4]
# Architecture
X = Input(shape=(A,))
embeddings = []
for a in range(A):
X_a = Lambda(lambda x: x[:, a])(X)
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
# Model A
model_A = Model(inputs=X, outputs=h)
model_A.compile('sgd', 'mse')
# Model B
Y = Input(shape=(A,))
model_B = Model(inputs=Y, outputs=model_A(Y))
model_B.compile('sgd', 'mse')
# Dummy dataset
x = np.array([[vocab_sizes[0] - 1, vocab_sizes[1] - 1]])
y = np.array([1])
# Train models
model_A.fit(x, y, epochs=10) # Works well
model_B.fit(x, y, epochs=10) # Fails
From the error above, it somehow seems that the input x[:, 1] is wrongly being fed to the first embedding layer with vocabulary size 2, as opposed to the second. Interestingly, when I swap the vocabulary sizes (e.g. set vocab_sizes = [4, 2]) it works, supporting the previous hypothesis.
For some weird reason, looping the tensor is causing this error.
You can replace your slicing with tf.split, use the necessary adjusts and it will work well:
Extra imports:
import tensorflow as tf
from keras.layers import Flatten
# Architecture
X = Input(shape=(A,))
X_as = Lambda(lambda x: tf.split(x, A, axis=1))(X)
embeddings = []
for a, x in enumerate(X_as):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(x)
embeddings.append(embedding)
h = Concatenate(axis=1)(embeddings)
h = Flatten()(h)
h = Dense(1)(h)
Why does this happen?
Well, it's very hard to guess. My assumption is that the system is trying to apply the lambda layer using the actual variable a instead of the value you gave before (this should not be happenning, I guess, but I had exatly this problem once when loading a model: one of the variables kept its last value when loading the model instead of having a looped value)
One thing that supports this explanation is trying constants instead of a:
#Architecture
X = Input(shape=(A,))
embeddings = []
X_a1 = Lambda(lambda x: x[:, 0], name = 'lamb_'+str(0))(X)
X_a2 = Lambda(lambda x: x[:, 1], name = 'lamb_'+str(1))(X)
xs = [X_a1, X_a2]
for a, X_a in enumerate(xs):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
Solution if you want to avoid tf.split
Another thing that works (and supports the explanation that the Lambda might be using the last value of a in your code for model_B) is making the entire loop inside the Lambda layer, this way, a doesn't get any unexpected values:
#Architecture
X = Input(shape=(A,))
X_as = Lambda(lambda x: [x[:, a] for a in range(A)])(X)
embeddings = []
for a, X_a in enumerate(X_as):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
I believe the following is happening:
(1) When you do the initial "for loop" over the Lambda function, you are initializing the constant tensors which feed into the "strided_slice" operator that extracts either the [:,0] or [:,1] correctly. Using the global variable "a" in the Lambda function is probably "risky" but works okay in this instance. Furthermore, I believe that the function is being stored in bytecode as "lambda x: x[:, a]" so it will try to look up whatever the value of "a" is at the time of evaluation. "a" could be anything so might be problematic under some cases.
(2) When you build the first model (model_A), the constant tensors are not reinitialized, so the lambda functions (strided_slice operator) has the correct values (0 and 1) which were initialized in the "for loop."
(3) When you build the second model (model_B), the constant tensors are reinitialized. However, at this time, the value of "a" is now 1 (as stated by some of the other commentary), because that is the final value after the original "for loop." In fact, you can set a=0, just before defining model_B, and you'll actually get behavior which corresponds to both Lambdas extracting [:,0] and feeding it to the embedded layers. My speculation for this difference in behavior is perhaps related to calling the Model_A(X) class initialization in this case (whereas in the first model, you only specified the output layer "h" and didn't call the Model_A() class as the output - this difference I believe was also suggested by other commentary).
I'll say that I verified this state of affairs by putting in some print statements in the file "frameworks/constant_op.py" during the operator initialization step and obtained debug statements with values and sequences consistent with what I stated above.
I hope this helps.
Related
Problem
I am using the Model API to create a Keras network that takes in two inputs and one output. When training the network I get the following error:
Error when checking model input: the list of Numpy arrays that you
are passing to your model is not the size the model expected. Expected
to see 2 array(s), but instead got the following list of 1 arrays:
Despite this error, the input X array has a shape of (2,8), and the output y array has a shape of (1,4).
Things already tried
There are a number of similar questions on SO, however, their solutions largely revolves around ensuring X and y are Numpy arrays. As seen in my implementation, I have already done that. Thus, I do not believe this is a duplicate question.
Implementation
I have defined the model as follows:
opt = Adam(lr = alpha)
input = Input(shape=(input_dim_,))
delta = Input(shape=[1])
l1 = Dense(units = 1024, input_dim = input_dim_, activation = "relu")(input)
l2 = Dense(units=512, activation="relu")(l1)
def loss_function (y,y_pred):
y_pred = K.clip(y_pred,1e-8,1-1e-8)
return K.sum(-y*K.log(y_pred)*delta)
if model_type == "actor":
out = Dense(units = output_dim_, activation="softmax")(l2)
model = Model(input=[input,delta], output = [out])
model.compile(loss = loss_function,optimizer=opt)
And train the model by doing the following:
X = [s_t,delta]
X = np.array(X)
actor.fit(X,y,verbose=0)
You are not passing the data correctly in fit:
actor.fit(X,y,verbose=0)
Here X should be a list containing two numpy arrays, and each numpy array corresponds to one of your inputs (you have a model with two inputs): So it should be more like this:
X = [np.array(s_t), np.array(delta)]
actor.fit(X, y, verbose=0)
Then it should work.
I've made a sequential model in keras, for generating musical sequences. Something very simple, with LSTM and dense softmax. I have 333 possible musical events
I know that model.fit() needs all training data in memory, which is a problem if it is one hot encoded. So I give the model an integer as input, transform this to one hot encoding in a Lambda layer, and then use sparse categorical cross entropy for loss. Because each batch would be transformed to one hot encoding on the fly, I thought that this would sort out my memory issues. But instead, it hangs at the beginning of fitting, and fills up my memory, even with small batch size. Evidently, I'm not understanding something about how keras works, which is not surprising, given that I'm new to it (and on that note, please point out anything too naive in my code).
1) what is happening behind the scenes? What is it about keras that I'm not understanding? It seems like keras is going ahead and running the Lambda layer on all of my training examples before doing any training.
2) How can I solve this, and make keras do it truly on the fly? Can I solve it with model.fit(), which I'm currently using, or do I need model.fit_generator(), which to me looks like it could solve this rather easily?
Here is some of my code:
def musicmodel(Tx, n_a, n_values):
"""
Arguments:
Tx -- length of a sequence in the corpus
n_a -- the number of activations used in our model (for the LSTM)
n_values -- number of unique values in the music data
Returns:
model -- a keras model
"""
# Define the input with a shape
X = Input(shape=(Tx,))
# Define s0, initial hidden state for the decoder LSTM
a0 = Input(shape=(n_a,), name='a0')
c0 = Input(shape=(n_a,), name='c0')
a = a0
c = c0
# Create empty list to append the outputs to while iterating
outputs = []
# Step 2: Loop
for t in range(Tx):
# select the "t"th time step from X.
x = Lambda(lambda x: x[:,t])(X)
# We need the class represented in one hot fashion:
x = Lambda(lambda x: tf.one_hot(K.cast(x, dtype='int32'), n_values))(x)
# We then reshape x to be (1, n_values)
x = reshapor(x)
# Perform one step of the LSTM_cell
a, _, c = LSTM_cell(x, initial_state=[a, c])
# Apply densor to the hidden state output of LSTM_Cell
out = densor(a)
# Add the output to "outputs"
outputs.append(out)
# Step 3: Create model instance
model = Model(inputs=[X,a0,c0],outputs=outputs)
return model
I then fit my model:
model = musicmodel(Tx, n_a, n_values)
opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
a0 = np.zeros((m, n_a))
c0 = np.zeros((m, n_a))
model.fit([X, a0, c0], list(Y), validation_split=0.25, epochs=600, verbose=2, batch_size=4)
In Tensorflow 1.9, I want to create a network and then recursively feed the output (the prediction) of the network back into the input of the network. During this loop, I want to store the predictions made by the network in a list.
Here is my attempt:
# Define the number of steps over which to loop the network
num_steps = 5
# Define the network weights
weights_1 = np.random.uniform(0, 1, [1, 10]).astype(np.float32)
weights_2 = np.random.uniform(0, 1, [10, 1]).astype(np.float32)
# Create a variable to store the predictions, one for each loop
predictions = tf.Variable(np.zeros([num_steps, 1]), dtype=np.float32)
# Define the initial prediction to feed into the loop
initial_prediction = np.array([[0.1]], dtype=np.float32)
x = initial_prediction
# Loop through the predictions
for step_num in range(num_steps):
x = tf.matmul(x, weights_1)
x = tf.matmul(x, weights_2)
predictions[step_num-1].assign(x)
# Define the final prediction
final_prediction = x
# Start a session
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Make the predictions
last_pred, all_preds = sess.run([final_prediction, predictions])
print(last_pred)
print(all_preds)
And this prints out:
[[48.8769]]
[[0.]
[0.]
[0.]
[0.]
[0.]]
So whilst the value of final_prediction appears correct, the value of predictions is not what I would expect. It seems that predictions is never actually assigned to, despite the line predictions[step_num-1].assign(x).
Please can somebody explain to me why this isn't working, and what I should be doing instead? Thanks!
This happens because assign ist just a TF op like any other, and as such is only executed if needed. Since nothing on the path to final_prediction relies on the assign op, and predictions is just a variable, the assignment is never executed.
I think the most straightforward solution would be to replace the line
predictions[step_num-1].assign(x)
by
x = predictions[step_num-1].assign(x)
This works because assign also returns the value it is assigning. Now, to compute final_prediction TF actually needs to "go through" the assign op so the assignments should be carried out.
Another option would be to use tf.control_dependencies which is a way to "force" TF to compute specific ops when it is computing other ones. However in this case it could be a bit icky because the op we want to force (assign) depends on values that are being computed within the loop and I'm not sure about the order in which TF does stuff in this case. The following should work:
for step_num in range(num_steps):
x = tf.matmul(x, weights_1)
x = tf.matmul(x, weights_2)
with tf.control_dependencies([predictions[step_num-1].assign(x)]):
x = tf.identity(x)
We use tf.identity as a noop just to have something to wrap with control_dependencies. I think this is the more flexible option between the two. However it comes with some caveats discussed in the docs.
I have been following a tutorial that shows how to make a word2vec model.
This tutorial uses this piece of code:
similarity = merge([target, context], mode='cos', dot_axes=0) (no other info was given, but I suppose this comes from keras.layers)
Now, I've researched a bit on the merge method but I couldn't find much about it.
From what I understand, it has been replaced by a lot of functions like layers.Add(), layers.Concat()....
What should I use? There's .Dot(), which has an axis parameter (which seems to be correct) but no mode parameter.
What can I use in this case?
The Dot layer in Keras now supports built-in Cosine similarity using the normalize = True parameter.
From the Keras Docs:
keras.layers.Dot(axes, normalize=True)
normalize: Whether to L2-normalize samples along the dot product axis before taking the dot product. If set to True, then the output of the dot product is the cosine proximity between the two samples.
Source
There are a few things that are unclear from the Keras documentation that I think are crucial to understanding:
For each function in the keras documentation for Merge, there is a lower case and upper case one defined i.e. add() and Add().
On Github, farizrahman4u outlines the differences:
Merge is a layer.
Merge takes layers as input
Merge is usually used with Sequential models
merge is a function.
merge takes tensors as input.
merge is a wrapper around Merge.
merge is used in Functional API
Using Merge:
left = Sequential()
left.add(...)
left.add(...)
right = Sequential()
right.add(...)
right.add(...)
model = Sequential()
model.add(Merge([left, right]))
model.add(...)
using merge:
a = Input((10,))
b = Dense(10)(a)
c = Dense(10)(a)
d = merge([b, c])
model = Model(a, d)
To answer your question, since Merge has been deprecated, we have to define and build a layer ourselves for the cosine similarity. In general this will involve using those lowercase functions, which we wrap within a Lambda to create a layer that we can use within a model.
I found a solution here:
from keras import backend as K
def cosine_distance(vests):
x, y = vests
x = K.l2_normalize(x, axis=-1)
y = K.l2_normalize(y, axis=-1)
return -K.mean(x * y, axis=-1, keepdims=True)
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0],1)
distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([processed_a, processed_b])
Depending on your data, you may want to remove the L2 normalization. What is important to note about the solution is that it is built using the Keras function api e.g. K.mean() - I think this is necessary when defining custom layer or even loss functions.
Hope I was clear, this was my first SO answer!
Maybe this will help you
(I spent a lot of time to make sure that these are the same things)
import tensorflow as tf
with tf.device('/CPU:' + str(0)):
print(tf.losses.CosineSimilarity()([1.0,1.0,1.0,-1.0],[4.0,4.0,4.0,5.0]))
print(tf.keras.layers.dot([tf.Variable([[1.0,1.0,1.0,-1.0]]),tf.Variable([[4.0,4.0,4.0,5.0]])], axes=1, normalize=True))
Output (Pay attention to the sign):
tf.Tensor(-0.40964404, shape=(), dtype=float32)
tf.Tensor([[0.40964404]], shape=(1, 1), dtype=float32)
If you alter the last code block of the tutorial as follows, you can see that the (average) loss is decreasing nicely with the Dot solution suggested by SantoshGuptaz7 (comment in the question above):
display_after_epoch = 10000
display_after_epoch_2 = 10 * display_after_epoch
loss_sum = 0
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
loss_sum += loss
if cnt % display_after_epoch == 0 and cnt != 0:
print("\nIteration {}, loss={}".format(cnt, loss_sum / cnt))
loss_sum = 0
if cnt % display_after_epoch_2 == 0:
sim_cb.run_sim()
Currently I am having trouble in making a few elements in a variable as non-trainable. It implies that given a variable such as x,
x= tf.Variable(tf.zeros([2,2]))
I wish to train only x[0,0] and x[1,1] while keeping x[0,1] ans x[1.0] as fixed while training.
Currently tensorflow does provide the option to make any variable non-trainable by using trainable=False or tf.stop_gradient(). However, these method will make the all element in x as non-trainable. My question is how to obtain this selectivity?
There is no selective lack of update as for now; however you can achieve this effect indirectly by specifing explicitely variables that should be updated. Both .minimize and all the gradient functions accept the list of variables you want to optimize over - just create a list omitting some of these, for example
v1 = tf.Variable( ... ) # we want to freeze it in one op
v2 = tf.Variable( ... ) # we want to freeze it in another op
v3 = tf.Variable( ... ) # we always want to train this one
loss = ...
optimizer = tf.train.GradientDescentOptimizer(0.1)
op1 = optimizer.minimize(loss,
var_list=[v for v in tf.get_collection(tf.TRAINABLE_VARIABLES) if v != v1])
op2 = optimizer.minimize(loss,
var_list=[v for v in tf.get_collection(tf.TRAINABLE_VARIABLES) if v != v2])
and now you can call them whenever you want to train wrt. subset of variables. Note that this might require 2 separate optimizers if you are using Adam or some other method gathering statistics (and you will end up with separate statistics per optimizer!). However if there is just one set of frozen variables per training - everything will be straightforward with var_list.
However there is no way to fix training of the subset of the variable. Tensorflow treats variable as a single unit, always. You have to specify your computations in a different way to achieve this, one way is to:
create a binary mask M with 1's where you want to stop updates over X
create separate variable X', which is non-trainable, and tf.assign to it value of X
output X'*M + (1-M)*X
for example:
x = tf.Variable( ... )
xp= tf.Variable( ..., trainable=False)
m = tf.Constant( ... ) # mask
cp= tf.Assign(x, xp)
with tf.control_dependencies([cp]):
x_frozen = m*xp + (1-m)*x
and you just use x_frozen instead of x. Note that we need control dependency as tf.assign can execute asynchronously, and here we want to make sure it always has the most up to date value of x.
You can use tf.stop_gradient trick to prevent masked tf.Variable elements from training. For example:
x = tf.Variable(tf.zeros([2, 2]))
mask = tf.constant([[1, 0], [0, 1]], dtype=x.dtype)
x = mask * x + tf.stop_gradient((1 - mask) * x)