Currently I am having trouble in making a few elements in a variable as non-trainable. It implies that given a variable such as x,
x= tf.Variable(tf.zeros([2,2]))
I wish to train only x[0,0] and x[1,1] while keeping x[0,1] ans x[1.0] as fixed while training.
Currently tensorflow does provide the option to make any variable non-trainable by using trainable=False or tf.stop_gradient(). However, these method will make the all element in x as non-trainable. My question is how to obtain this selectivity?
There is no selective lack of update as for now; however you can achieve this effect indirectly by specifing explicitely variables that should be updated. Both .minimize and all the gradient functions accept the list of variables you want to optimize over - just create a list omitting some of these, for example
v1 = tf.Variable( ... ) # we want to freeze it in one op
v2 = tf.Variable( ... ) # we want to freeze it in another op
v3 = tf.Variable( ... ) # we always want to train this one
loss = ...
optimizer = tf.train.GradientDescentOptimizer(0.1)
op1 = optimizer.minimize(loss,
var_list=[v for v in tf.get_collection(tf.TRAINABLE_VARIABLES) if v != v1])
op2 = optimizer.minimize(loss,
var_list=[v for v in tf.get_collection(tf.TRAINABLE_VARIABLES) if v != v2])
and now you can call them whenever you want to train wrt. subset of variables. Note that this might require 2 separate optimizers if you are using Adam or some other method gathering statistics (and you will end up with separate statistics per optimizer!). However if there is just one set of frozen variables per training - everything will be straightforward with var_list.
However there is no way to fix training of the subset of the variable. Tensorflow treats variable as a single unit, always. You have to specify your computations in a different way to achieve this, one way is to:
create a binary mask M with 1's where you want to stop updates over X
create separate variable X', which is non-trainable, and tf.assign to it value of X
output X'*M + (1-M)*X
for example:
x = tf.Variable( ... )
xp= tf.Variable( ..., trainable=False)
m = tf.Constant( ... ) # mask
cp= tf.Assign(x, xp)
with tf.control_dependencies([cp]):
x_frozen = m*xp + (1-m)*x
and you just use x_frozen instead of x. Note that we need control dependency as tf.assign can execute asynchronously, and here we want to make sure it always has the most up to date value of x.
You can use tf.stop_gradient trick to prevent masked tf.Variable elements from training. For example:
x = tf.Variable(tf.zeros([2, 2]))
mask = tf.constant([[1, 0], [0, 1]], dtype=x.dtype)
x = mask * x + tf.stop_gradient((1 - mask) * x)
Related
For example:
x = torch.tensor([1.], requires_grad=True)
with torch.no_grad():
b = torch.tensor([1.])
y = x * 2 + b
How do y and b know they should be initialized with requires_grad=False?
AFAIK, from the source code for torch.no_grad(), it will finally call torch._C._set_grad_enabled(False). I am stuck here because I do not know what happened with this call. Is it setting a global variable that tensors can access when initializing and thus tensors know it does not require gradient for now?
Simply because a tensor that is calculated from other tensors never requires grad by default no matter the context (grad or no grad). torch._C._set_grad_enabled(False) simply deactivate all methods save_for_backward that save required tensors for computing backprop later and do not create nodes and edges used for C++ backprop graph.
I am trying to construct a Keras model model_B that outputs the output of another Keras model model_A. Now, the output of model_A is computed from the concatenation of several tensors coming from multiple Keras embedding layers with different vocabulary sizes. Models model_A and model_B are essentially the same.
Problem: When I train model_A, everything works fine. However, when I train model_B on the same dataset, I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError:
indices[1] = 3 is not in [0, 2) [[{{node model_1/embedding_1/embedding_lookup}}]]
Essentially, the error is saying that the index of a word is outside of the expected vocabulary, but this is not the case. Could someone clarify why this happens?
Here is a reproducible example of the problem:
from keras.layers import Input, Dense, Lambda, Concatenate, Embedding
from keras.models import Model
import numpy as np
# Constants
A = 2
vocab_sizes = [2, 4]
# Architecture
X = Input(shape=(A,))
embeddings = []
for a in range(A):
X_a = Lambda(lambda x: x[:, a])(X)
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
# Model A
model_A = Model(inputs=X, outputs=h)
model_A.compile('sgd', 'mse')
# Model B
Y = Input(shape=(A,))
model_B = Model(inputs=Y, outputs=model_A(Y))
model_B.compile('sgd', 'mse')
# Dummy dataset
x = np.array([[vocab_sizes[0] - 1, vocab_sizes[1] - 1]])
y = np.array([1])
# Train models
model_A.fit(x, y, epochs=10) # Works well
model_B.fit(x, y, epochs=10) # Fails
From the error above, it somehow seems that the input x[:, 1] is wrongly being fed to the first embedding layer with vocabulary size 2, as opposed to the second. Interestingly, when I swap the vocabulary sizes (e.g. set vocab_sizes = [4, 2]) it works, supporting the previous hypothesis.
For some weird reason, looping the tensor is causing this error.
You can replace your slicing with tf.split, use the necessary adjusts and it will work well:
Extra imports:
import tensorflow as tf
from keras.layers import Flatten
# Architecture
X = Input(shape=(A,))
X_as = Lambda(lambda x: tf.split(x, A, axis=1))(X)
embeddings = []
for a, x in enumerate(X_as):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(x)
embeddings.append(embedding)
h = Concatenate(axis=1)(embeddings)
h = Flatten()(h)
h = Dense(1)(h)
Why does this happen?
Well, it's very hard to guess. My assumption is that the system is trying to apply the lambda layer using the actual variable a instead of the value you gave before (this should not be happenning, I guess, but I had exatly this problem once when loading a model: one of the variables kept its last value when loading the model instead of having a looped value)
One thing that supports this explanation is trying constants instead of a:
#Architecture
X = Input(shape=(A,))
embeddings = []
X_a1 = Lambda(lambda x: x[:, 0], name = 'lamb_'+str(0))(X)
X_a2 = Lambda(lambda x: x[:, 1], name = 'lamb_'+str(1))(X)
xs = [X_a1, X_a2]
for a, X_a in enumerate(xs):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
Solution if you want to avoid tf.split
Another thing that works (and supports the explanation that the Lambda might be using the last value of a in your code for model_B) is making the entire loop inside the Lambda layer, this way, a doesn't get any unexpected values:
#Architecture
X = Input(shape=(A,))
X_as = Lambda(lambda x: [x[:, a] for a in range(A)])(X)
embeddings = []
for a, X_a in enumerate(X_as):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
I believe the following is happening:
(1) When you do the initial "for loop" over the Lambda function, you are initializing the constant tensors which feed into the "strided_slice" operator that extracts either the [:,0] or [:,1] correctly. Using the global variable "a" in the Lambda function is probably "risky" but works okay in this instance. Furthermore, I believe that the function is being stored in bytecode as "lambda x: x[:, a]" so it will try to look up whatever the value of "a" is at the time of evaluation. "a" could be anything so might be problematic under some cases.
(2) When you build the first model (model_A), the constant tensors are not reinitialized, so the lambda functions (strided_slice operator) has the correct values (0 and 1) which were initialized in the "for loop."
(3) When you build the second model (model_B), the constant tensors are reinitialized. However, at this time, the value of "a" is now 1 (as stated by some of the other commentary), because that is the final value after the original "for loop." In fact, you can set a=0, just before defining model_B, and you'll actually get behavior which corresponds to both Lambdas extracting [:,0] and feeding it to the embedded layers. My speculation for this difference in behavior is perhaps related to calling the Model_A(X) class initialization in this case (whereas in the first model, you only specified the output layer "h" and didn't call the Model_A() class as the output - this difference I believe was also suggested by other commentary).
I'll say that I verified this state of affairs by putting in some print statements in the file "frameworks/constant_op.py" during the operator initialization step and obtained debug statements with values and sequences consistent with what I stated above.
I hope this helps.
In Tensorflow 1.9, I want to create a network and then recursively feed the output (the prediction) of the network back into the input of the network. During this loop, I want to store the predictions made by the network in a list.
Here is my attempt:
# Define the number of steps over which to loop the network
num_steps = 5
# Define the network weights
weights_1 = np.random.uniform(0, 1, [1, 10]).astype(np.float32)
weights_2 = np.random.uniform(0, 1, [10, 1]).astype(np.float32)
# Create a variable to store the predictions, one for each loop
predictions = tf.Variable(np.zeros([num_steps, 1]), dtype=np.float32)
# Define the initial prediction to feed into the loop
initial_prediction = np.array([[0.1]], dtype=np.float32)
x = initial_prediction
# Loop through the predictions
for step_num in range(num_steps):
x = tf.matmul(x, weights_1)
x = tf.matmul(x, weights_2)
predictions[step_num-1].assign(x)
# Define the final prediction
final_prediction = x
# Start a session
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Make the predictions
last_pred, all_preds = sess.run([final_prediction, predictions])
print(last_pred)
print(all_preds)
And this prints out:
[[48.8769]]
[[0.]
[0.]
[0.]
[0.]
[0.]]
So whilst the value of final_prediction appears correct, the value of predictions is not what I would expect. It seems that predictions is never actually assigned to, despite the line predictions[step_num-1].assign(x).
Please can somebody explain to me why this isn't working, and what I should be doing instead? Thanks!
This happens because assign ist just a TF op like any other, and as such is only executed if needed. Since nothing on the path to final_prediction relies on the assign op, and predictions is just a variable, the assignment is never executed.
I think the most straightforward solution would be to replace the line
predictions[step_num-1].assign(x)
by
x = predictions[step_num-1].assign(x)
This works because assign also returns the value it is assigning. Now, to compute final_prediction TF actually needs to "go through" the assign op so the assignments should be carried out.
Another option would be to use tf.control_dependencies which is a way to "force" TF to compute specific ops when it is computing other ones. However in this case it could be a bit icky because the op we want to force (assign) depends on values that are being computed within the loop and I'm not sure about the order in which TF does stuff in this case. The following should work:
for step_num in range(num_steps):
x = tf.matmul(x, weights_1)
x = tf.matmul(x, weights_2)
with tf.control_dependencies([predictions[step_num-1].assign(x)]):
x = tf.identity(x)
We use tf.identity as a noop just to have something to wrap with control_dependencies. I think this is the more flexible option between the two. However it comes with some caveats discussed in the docs.
I created a function func that contains some variables. Now, I want to use this function standalone and also through tf.map_fn function and I want to keep the same set of variables for both the cases. But, apparently tf.map_fn function appends the current variable scope with map and hence the variable scope of standalone case can no longer matches the case with tf.map_fn. So, the following code throws an error as variable mul1/map/weights does not exist before calling it with reuse=True.
import tensorflow as tf
D = 5
batch_size = 1
def func(x):
W = tf.get_variable(initializer=tf.constant_initializer(1), shape=[D,1], dtype=tf.float32, trainable=True, name="weights")
y = tf.matmul(x, W)
return y
x = tf.placeholder(tf.float32, [batch_size, 5])
x_cat = tf.placeholder(tf.float32, [None, batch_size, 5])
with tf.variable_scope("mul1") as mul1_scope:
y_sum = func(x)
with tf.variable_scope(mul1_scope, reuse=True):
cost = tf.map_fn(lambda x: func(x), x_cat)
Here I want to run gradient update only on the variables under mul1/map scope. So, I can probably use tf.assign after every update to change the variables under mul1 scope (which is used only for the feedforward step). But that's a rather painful way to do variable sharing. So, I was wondering if there is any better way to solve this. Any help would be much appreciated !
Say in tensorflow, I created a variable by
C = tf.Variable(tf.random_uniform([n_sample, n_sample], -1, 1), name='C'),
now I want to get a pointer to the first column of the variable, is there anyway I could do that? Would tf.slice(C, [0,0], [n_sample,1]) give me what I want or it will just create another variable with value stored in C.
The reason that I want to do it is because my optimization function is dependent on both C and each columns of C.
As far as I know you can't really get access to the data itself (i.e. like a pointer). The reasoning being is that the code will be data agnostic so that it can pass around the data to different CPUs or GPUs without you worrying about that part (or you could specify device to use but that gets cumbersome).
So tf.slice would be the correct function to use.
you could do :
for i in range(n_sample):
curr_slice = tf.slice(C, [i,0], [n_sample,1])
do_something(curr_slice)
this isn't the most efficient version but it's what you asked for in the comments.
for i inVectorized range(n_sample):approach
curr_sliceloss = tf.slice(C, [i,0], [n_sample,1])
y.assign_add( tf.nn.l2_loss(tf.sub(curr_slice,X - tf.matmul(X,curr_slice)C)) + lambdalamb * tf.nn.l2_loss(curr_slice) C)
loss=tf.reduce_sum(y)
Vectorized approach much cleaner:
loss = tf.nn.l2_loss(X - tf.matmul(X,C)) + lamb * tf.nn.l2_loss(C)
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
sess.run(train_step)
You might need to initialize the some of the values by creating placeholders.
Alternatively I couldn't find it in skflow yet but in scikit learn it's a simple 3 liner.
from sklearn.linear_model import Ridge
clf = Ridge(alpha=1.0)
clf.fit(X, W)