I have a Tensorflow 2.x model with the purpose of dynamically choosing a computational path. Here's a schematic drawing of this model:
The only trainable block is the Decision Module (DM), which is essentially a fully connected layer with a single binary output (0 or 1; It's differentiable using a technique called Improved Semantic Hashing). Nets A & B have the same network architecture.
In the training progress, I feed forward a batch of images until the output of the DM, and then process the decision image-by-image, directing each image to the decided net (A or B). The predictions are concatenated into a single tensor, who's used to evaluate the performance. Here's the training code (sigma is the output of the DM; model includes the feature extractor and the DM):
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
# training=True is only needed if there are custom_layers with different
# behavior during training versus inference (e.g. Dropout).
_, sigma = model(images, training=True)
out = []
for img, s in zip(images, sigma):
if s == 0:
o = binary_classifier_model_a(tf.expand_dims(img, axis=0), training=False)
else:
o = binary_classifier_model_b(tf.expand_dims(img, axis=0), training=False)
out.append(o)
predictions = tf.concat(out, axis=0)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
The problem - when running this code, gradients returns [None, None].
What I know for now is:
The first part of the model (until the DM's output) is differentiable; I tested it by running only this section and applying a loss function (MSE) and then applying tape.gradients - I got actual gradients.
I tried choosing a single (constant) net - e.g, net A - and simply multiplying it's output by s (which is either 0 or 1); This is performed instead of the if-else block in the code. In this case I also got gradients.
My concern is that such thing might not be possible - quoting from the official docs:
x = tf.constant(1.0)
v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
if x > 0.0:
result = v0
else:
result = v1**2
Depending on the value of x in the above example, the tape either
records result = v0 or result = v1**2. The gradient with respect to
x is always None.
dx = tape.gradient(result, x)
print(dx)
>> None
I'm not 100% sure that this is my case, but I wanted to ask here for the experts' opinion.
Is what I'm trying to do possible? And if yes - what should I change in order for this to work?
Thanks
You correctly identified the issue. The control statement of the conditional is not differentiable, so you lose your link to the model variables that produced sigma.
In your case, because you state that sigma is either 1 or 0, you can use the value of sigma as a mask, and skip the conditional statement (and even the loop).
with tf.GradientTape() as tape:
_, sigma = model(images, training=True)
predictions = (1.0 - sigma) * binary_classifier_model_a(images, training=False)\
+ sigma * binary_classifier_model_b(images, training=False)
loss = loss_object(labels, predictions)
It seems solution to your ploblem is to control flow operations. Try using tf.where. You can implement your condition by doing something like this.
a = tf.constant([1, 1])
b = tf.constant([2, 2])
p = tf.constant([True, False])
x = tf.where(p, a + b, a * b)
For more information please refer this
Related
I'm puzzled by the fact that I've seen blocks of code that require tf.GradientTape().watch() to work and blocks that seem to work without it.
For example, this block of code requires the watch() function:
x = tf.ones((2,2)) #Note: In original posting of question,
# I didn't include this key line.
with tf.GradientTape() as t:
# Record the actions performed on tensor x with `watch`
t.watch(x)
# Define y as the sum of the elements in x
y = tf.reduce_sum(x)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
But, this block does not:
with tf.GradientTape() as tape:
logits = model(images, images = True)
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy.mean()
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradient(zip(grads, model.trainable_vaiables))
What is the difference between these two cases?
The documentation of watch says the following:
Ensures that tensor is being traced by this tape.
Any trainable variable that is accessed within the context of tape is watched by default. This means we can calculate the gradient with respect to that trainable variable by calling t.gradient(loss, variable). Check the following example:
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets, training=True)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
Hence no need to use tape.watch in the above code. But sometimes we need to calculate gradients with respect to some non trainable variable. In those cases we need to watch them.
with tf.GradientTape() as t:
t.watch(images)
predictions = cnn_model(images)
loss = tf.keras.losses.categorical_crossentropy(expected_class_output, predictions)
gradients = t.gradient(loss, images)
In the above code, images is input to model and not a trainable variable. I need to calculate the gradient of loss with respect to the images and hence I need to watch it.
Although the answer and comment given above were helpful, I think this code most clearly illustrates what I needed to know.
# Define a 2x2 array of 1's
x = tf.ones((2,2))
x1 = tf.Variable([[1,1],[1,1]],name='x1',dtype=tf.float32)
with tf.GradientTape(persistent=True) as t:
# Record the actions performed on tensor x with `watch`
# Define y as the sum of the elements in x
y = tf.reduce_sum(x) + tf.reduce_sum(x1)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
dz_dx1 = t.gradient(z, x1)
# Print results
print("dz_dx =",dz_dx)
print("dz_dx1 =", dz_dx1)
print("type(x) =", type(x))
print("type(x1) =", type(x1))
which gives,
dz_dx = None
dz_dx1 = tf.Tensor(
[[16. 16.]
[16. 16.]], shape=(2, 2), dtype=float32)
type(x) = <class 'tensorflow.python.framework.ops.EagerTensor'>
type(x1) = <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>
In Tensorflow, all variables are are given the property Trainable=True by default. So, they are automatically watched. I neglected (out of ignorance) to originally indicate that x was not a variable, but instead, was an EagerTensor, which is not watched by default.
model.trainable_variables is a list of trainable variables, which are watched by gradient tape, whereas a regular Tensor is not watched without specifically telling the tape to watch it.
I am new in Tensorflow. In the textbook example I saw the following code designed to train simple linear model using Tensorflow 2.x API:
m = tf.Variable(0.)
b = tf.Variable(0.)
def predict_y_value(x):
y = m * x + b
return y
def squared_error(y_pred, y_true):
return tf.reduce_mean(tf.square(y_pred - y_true))
learning_rate = 0.05
steps = 500
for i in range(steps):
with tf.GradientTape() as tape:
predictions = predict_y_value(x_train)
loss = squared_error(predictions, y_train)
gradients = tape.gradient(loss, [m, b])
m.assign_sub(gradients[0] * learning_rate)
b.assign_sub(gradients[1] * learning_rate)
print ("m: %f, b: %f" % (m.numpy(), b.numpy()))
Why it is necessary to include the definition of loss function into the block with tf.GradientTape() as tape , but gradients = tape.gradient(loss, [m, b]) code line is outside with block?
I understand that it may be Python language specific, but this construction seems unclear to me. What is the role of Python context manager here?
From tensorflow docs,
By default GradientTape will automatically watch any trainable variables that are accessed inside the context.
Intuitively, this approach enhances flexibility A LOT. For example, it allows you to write (pseudo)code as follows:
inputs, labels = get_training_batch()
inputs_preprocessed = some_tf_ops(inputs)
with tf.GradientTape() as tape:
pred = model(inputs_preprocessed)
loss = compute_loss(labels, pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# For example, let's attach a model that takes the above model's output as input
next_step_inputs, next_step_labels = process(pred)
with tf.GradientTape() as tape:
pred = another_model(next_step_inputs)
another_loss = compute_loss(next_step_labels, pred)
grads = tape.gradient(another_loss, another_model.trainable_variables)
optimizer.apply_gradients(zip(grads, another_model.trainable_variables))
The above example may look complicated, but it explains extreme situations that require extreme flexibility.
We don't want some_tf_ops and process to play a role in gradient computation, since they are preprocessing steps.
We want to compute gradients for multiple models, with some relationship
One practical example would be training GANs, although simpler implementations are possible.
Putting tape.gradient outside of TapeGradient() resets context and releases resources for the garbage collector.
Note 2 equivalent examples:
with tf.GradientTape() as t:
loss = loss_fn()
with tf.GradientTape() as t:
loss += other_loss_fn()
t.gradient(loss, ...) # Only differentiates other_loss_fn, not loss_fn
The following is equivalent to the above
with tf.GradientTape() as t:
loss = loss_fn()
t.reset()
loss += other_loss_fn()
t.gradient(loss, ...) # Only differentiates other_loss_fn, not loss_fn
source
I have the following function
def msfe(ys, ts):
ys=ys.detach().numpy() #output from the network
ts=ts.detach().numpy() #Target (true labels)
pred_class = (ys>=0.5)
n_0 = sum(ts==0) #Number of true negatives
n_1 = sum(ts==1) #Number of true positives
FPE = sum((ts==0)[[bool(p) for p in (pred_class==1)]])/n_0 #False positive error
FNE = sum((ts==1)[[bool(p) for p in (pred_class==0)]])/n_1 #False negative error
loss= FPE**2+FNE**2
loss=torch.tensor(loss,dtype=torch.float64,requires_grad=True)
return loss
and I wonder, if the autograd in Pytorch works properly, since ys and ts does not have the grad flag.
So my question is: do all the variables (FPE,FNE,ys,ts,n_1,n_0) have to be tensors, before optimizer.step() works, or is it okay that it is only the final function (loss) which is ?
All of the variables you want to optimise via optimizer.step() need to have gradient.
In your case it would be y predicted by network, so you shouldn't detach it (from graph).
Usually you don't change your targets, so those don't need gradients. You shouldn't have to detach them though, tensors by default don't require gradient and won't be backpropagated.
Loss will have gradient if it's ingredients (at least one) have gradient.
Overall you rarely need to take care of it manually.
BTW. don't use numpy with PyTorch, there is rarely ever the case to do so. You can perform most of the operations you can do on numpy array on PyTorch's tensor.
BTW2. There is no such thing as Variable in pytorch anymore, only tensors which require gradient and those that don't.
Non-differentiability
1.1 Problems with existing code
Indeed, you are using functions which are not differentiable (namely >= and ==). Those will give you trouble only in the case of your outputs, as those required gradient (you can use == and >= for targets though).
Below I have attached your loss function and outlined problems in it in the comments:
# Gradient can't propagate if you detach and work in another framework
# Most Python constructs should be fine, detaching will ruin it though.
def msfe(outputs, targets):
# outputs=outputs.detach().numpy() # Do not detach, no need to do that
# targets=targets.detach().numpy() # No need for numpy either
pred_class = outputs >= 0.5 # This one is non-differentiable
# n_0 = sum(targets==0) # Do not use sum, there is pytorch function for that
# n_1 = sum(targets==1)
n_0 = torch.sum(targets == 0) # Those are not differentiable, but...
n_1 = torch.sum(targets == 1) # It does not matter as those are targets
# FPE = sum((targets==0)[[bool(p) for p in (pred_class==1)]])/n_0 # Do not use Python bools
# FNE = sum((targets==1)[[bool(p) for p in (pred_class==0)]])/n_1 # Stay within PyTorch
# Those two below are non-differentiable due to == sign as well
FPE = torch.sum((targets == 0.0) * (pred_class == 1.0)).float() / n_0
FNE = torch.sum((targets == 1.0) * (pred_class == 0.0)).float() / n_1
# This is obviously fine
loss = FPE ** 2 + FNE ** 2
# Loss should be a tensor already, don't do things like that
# Gradient will not be propagated, you will have a new tensor
# Always returning gradient of `1` and that's all
# loss = torch.tensor(loss, dtype=torch.float64, requires_grad=True)
return loss
1.2 Possible solution
So, you need to get rid of 3 non-differentiable parts. You could in principle try to approximate it with continuous outputs from your network (provided you are using sigmoid as activation). Here is my take:
def msfe_approximation(outputs, targets):
n_0 = torch.sum(targets == 0) # Gradient does not flow through it, it's okay
n_1 = torch.sum(targets == 1) # Same as above
FPE = torch.sum((targets == 0) * outputs).float() / n_0
FNE = torch.sum((targets == 1) * (1 - outputs)).float() / n_1
return FPE ** 2 + FNE ** 2
Notice that to minimize FPE outputs will try to be zero on the indices where targets are zero. Similarly for FNE, if targets are 1, network will try to output 1 as well.
Notice similarity of this idea to BCELoss (Binary CrossEntropy).
And lastly, example you can run this on, just for sanity check:
if __name__ == "__main__":
model = torch.nn.Sequential(
torch.nn.Linear(30, 100),
torch.nn.ReLU(),
torch.nn.Linear(100, 200),
torch.nn.ReLU(),
torch.nn.Linear(200, 1),
torch.nn.Sigmoid(),
)
optimizer = torch.optim.Adam(model.parameters())
targets = torch.randint(high=2, size=(64, 1)) # random targets
inputs = torch.rand(64, 30) # random data
for _ in range(1000):
optimizer.zero_grad()
outputs = model(inputs)
loss = msfe_approximation(outputs, targets)
print(loss)
loss.backward()
optimizer.step()
print(((model(inputs) >= 0.5) == targets).float().mean())
I am trying to implement a Siamese network with a ranking loss between two images. If I define my own loss would I be able to do the backpropagation step as follows? When I run it sometimes it seems to me that it is giving the same results that the single network gives.
with torch.set_grad_enabled(phase == 'train'):
outputs1 = model(inputs1)
outputs2 = model(inputs2)
preds1 = outputs1;
preds2 = outputs2;
alpha = 0.02;
w_r = torch.tensor(1).cuda(async=True);
y_i, y_j, predy_i, predy_j = labels1,labels2,outputs1,outputs2;
batchRankLoss = torch.tensor([max(0,alpha - delta(y_i[i], y_j[i])*predy_i[i] - predy_j[i])) for i in range(batchSize)],dtype = torch.float)
rankLossPrev = torch.mean(batchRankLoss)
rankLoss = Variable(rankLossPrev,requires_grad=True)
loss1 = criterion(outputs1, labels1)
loss2 = criterion(outputs2, labels2)
#total loss = loss1 + loss2 + w_r*rankLoss
totalLoss = torch.add(loss1,loss2)
w_r = w_r.type(torch.LongTensor)
rankLossPrev = rankLossPrev.type(torch.LongTensor)
mult = torch.mul(w_r.type(torch.LongTensor),rankLossPrev).type(torch.FloatTensor)
totalLoss = torch.add(totalLoss,mult.cuda(async = True));
# backward + optimize only if in training phase
if phase == 'train':
totalLoss.backward()
optimizer.step()
running_loss += totalLoss.item() * inputs1.size(0)
You have several lines where you generate new Tensors from a constructor or a cast to another data type. When you do this, you disconnect the chain of operations through which you'd like the backwards() command to differentiate.
This cast disconnects the graph because casting is non-differentiable:
w_r = w_r.type(torch.LongTensor)
Building a Tensor from a constructor will disconnect the graph:
batchRankLoss = torch.tensor([max(0,alpha - delta(y_i[i], y_j[i])*predy_i[i] - predy_j[i])) for i in range(batchSize)],dtype = torch.float)
From the docs, wrapping a Tensor in a Variable will set the grad_fn to None (also disconnecting the graph):
rankLoss = Variable(rankLossPrev,requires_grad=True)
Assuming that your critereon function is differentiable, then gradients are currently flowing backward only through loss1 and loss2. Your other gradients will only flow as far as mult before they are stopped by a call to type(). This is consistent with your observation that your custom loss doesn't change the output of your neural network.
To allow gradients to flow backward through your custom loss, you'll have to code the same logic while avoiding type() casts and calculate rankLoss without using a list comprehension.
rank_loss = torch.mean([torch.max(0,alpha - delta(y_i[i], y_j[i])*predy_i[i] - predy_j[i])) for i in range(batchSize)], dim=0)
w_r = 1.0
loss1 = criterion(outputs1, labels1)
loss2 = criterion(outputs2, labels2)
total_loss = loss1 + loss2 + w_r * rank_loss
if phase == 'train':
total_loss .backward()
optimizer.step()
You don't have to create a tensor over and over again. If you have different weights for each loss and weights are just constants, you can simply write:
total_loss = weight_1 * loss1 + weight_2 * loss2 + weight_3 * rank_loss
This is untrainable constant anyway, it does not make sense to create A variable and set requires_grad to True because weights are just constants.
Please upgrade to pytorch 0.4.1, in which you don't have to wrap everything with Variable
I've wrote my first Tensorflow program ( using my own data) . It works well at least it doesn't crash! but I'm getting a wired accuracy values either 0 oder 1 ?
.................................
the previous part of the code, is only about handeling csv file an getting Data in correct format / shapes
......................................................
# Tensoflow
x = tf.placeholder(tf.float32,[None,len(Training_Data[0])],name='Train_data')# each input has a 457 lenght
y_ = tf.placeholder(tf.float32,[None, numberOFClasses],name='Labels')#
#w = tf.Variable(tf.zeros([len(Training_Data[0]),numberOFClasses]),name='Weights')
w = tf.Variable(tf.truncated_normal([len(Training_Data[0]),numberOFClasses],stddev=1./10),name='Weights')
b = tf.Variable(tf.zeros([numberOFClasses]),name='Biases')
model = tf.add(tf.matmul(x,w),b)
y = tf.nn.softmax(model)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
#cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for j in range(len(train_data)):
if(np.shape(train_data) == (batchSize,numberOFClasses)):
sess.run(train_step,feed_dict={x:train_data[j],y_:np.reshape(train_labels[j],(batchSize,numberOFClasses)) })
correct_prediction = tf.equal(tf.arg_max(y,1),tf.arg_max(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,"float"))
accuracy_vector= []
current_class =[]
for i in range(len(Testing_Data)):
if( np.shape(Testing_Labels[i]) == (numberOFClasses,)):
accuracy_vector.append(sess.run(accuracy,feed_dict={x:np.reshape(Testing_Data[i],(1,457)),y_:np.reshape(Testing_Labels[i],(1,19))}))#,i)#,Test_Labels[i])
current_class.append(int(Test_Raw[i][-1]))
ploting theaccuracy_vector delivers the following :
[]
any idea what I'm missing here ?
thanks a lot for any hint !
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
tf.nn.softmax_cross_entropy_with_logits wants unscaled logits.
From the doc:
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
this means that the line y = tf.nn.softmax(model) is wrong.
Instead, you want to pass unscaled logits to that function, thus:
y = model
Moreover, once you fix this problem, if the network doesn't work, try to lower the learning rate from 0.01 to something about 1e-3 or 1e-4. (I tell you this because 1e-2 usually is an "high" learning rate)
You're testing on batches of size 1, so either the prediction is good or it's false, so you can only get 0 or 1 accuracy:
accuracy_vector.append(sess.run(accuracy,feed_dict={x:np.reshape(Testing_Data[i],(1,457)),y_:np.reshape(Testing_Labels[i],(1,19))}))#,i)#,Test_Labels[i])
Just use a bigger batch size :
accuracy_vector.append(sess.run(accuracy,feed_dict={x:np.reshape(Testing_Data[i:i+batch_size],(batch_size,457)),y_:np.reshape(Testing_Labels[i:i+batch_size],(batch_size,19))}))