In my data set I have for every entry (event) a weight. This weight consist of several quantities but basically represent how important this event for the data and must be accounted for.
How can I use this weights when training in Tensorflow? I don't want to simply use this as another feature.
Thanks
One simple solution is to multiply the computed cost for each example by its weight, before computing the overall cost for a mini-batch.
Let's say you have the following:
# Vector of features per example.
x = tf.placeholder(tf.float32, shape=[batch_size, num_features])
# Scalar weight per example.
x_weights = tf.placeholder(tf.float32, shape=[batch_size])
# Vector of outputs per example.
y = tf.placeholder(tf.float32, shape=[batch_size, num_outputs])
# ...
logits = ...
# Insert appropriate cost function here.
cost = tf.nn.softmax_cross_entropy_with_logits(logits, y)
The computed cost tensor is a vector of length batch_size. You can simply perform an element-wise multiplication with x_weights to get a weighted cost.
overall_cost = tf.mul(cost, x_weights) / batch_size
Finally you can use overall_cost as the value to minimize in your optimizer.
Related
I am performing multi-label image classification in PyTorch, and would like to compute the gradients of all outputs at ground truth labels for each input with respect to the input. I would preferably like to do this in a single backward pass for a batch of inputs.
For example:
inputs = torch.randn((4,3,224,224)) # Batch of 4 inputs
targets = torch.tensor([[1,0,1],[1,0,0],[0,0,1],[1,1,0]]) # Labels for each input
outputs = model(inputs) # 4 x 3 vector
Here, I want to find the gradient of:
output[0,0] and output[0,2] with respect to input[0]
output[1,0] with respect to input[1]
output[2,2] with respect to input[2]
output[3,0] and output[3,1] with respect to input[3]
Is there any way to do this in a single backward pass?
If my outputs were one-hot, i.e., there was only one label per class, I could use:
gt_classes = torch.where(targets==1)[1]
gather_outputs = torch.gather(outputs, 1, gt_classes.unsqueeze(-1))
grads = torch.autograd.grad(torch.unbind(gather_outputs), inputs)[0] # 4 x 3 x 224 x 224
This gives gradient of output[i,gt_classes[i]] with respect to input[i].
For my case, it looks like the is_grads_batched argument from torch.autograd.grad might be relevant, but it's not very clear how it is to be used.
I am handeling a timeseries dataset with n timesteps, m features and k objects.
As a result my feature vector has a shape of (n,k,m) While my targets shape is (n,m)
I want to predict the targets for every timestep and object, but with the same weights for every opject. Also my loss function looks like this.
average_loss = loss_func(prediction, labels)
sum_loss = loss_func(sum(prediction), sum(labels))
loss = loss_weight * average_loss + (1-loss_weight) * sum_loss
My plan is to not only make sure, that I predict every item as good as possible, but also that the sum of all items get perdicted. loss_weights is a constant.
Currently I am doing this kind of ugly solution:
features = local_batch.squeeze(dim = 0)
labels = torch.unsqueeze(local_labels.squeeze(dim = 0), 1)
prediction = net(features)
I set my batchsize = 1. And squeeze it to make the k objects my batch.
My network looks like this:
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.predict = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
How do I make sure I do a reasonable convolution over the opject dimension in order to keep the same weights for all objects, without commiting to batchsize=1? Also, how do I achieve the same loss function, where I compute the loss of the prediction sum vs target sum for any timestamp?
It's not exactly ugly -- I would do the same but generalize it a bit for batch size >1 using view.
# Using your notations
n, k, m = features.shape
features = local_batch.view(n*k, m)
prediction = net(features).view(n, k, m)
With the prediction in the correct shape (n*k*m), implementing your loss function should not be difficult.
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
I'm following this tutorial for tensorflow:
It describes the implementation of the cross entropy function as:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
First, tf.log computes the logarithm of each element of y. Next, we
multiply each element of y_ with the corresponding element of
tf.log(y). Then tf.reduce_sum adds the elements in the second
dimension of y, due to the reduction_indices=1 parameter. Finally,
tf.reduce_mean computes the mean over all the examples in the batch.
It is my understanding that both the actual and predicted values of y, from reading the tutorial, are 2D tensors. The rows are the number of MNIST vectors that you use of size 784 which represents the columns.
The quote above says that "we multiply each element of y_ with the corresponding element of tf.log(y)".
My question is - are we doing traditional matrix multiplication here i.e row x column because the sentence suggests that we are not?
The traditional matrix multiplication is only used when calculating the model hypothesis as seen in the code to multiply x by W:
y = tf.nn.softmax(tf.matmul(x, W) + b)
The code y_ * tf.log(y) in the code block:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y),
reduction_indices=[1]))
performs an element-wise multiplication of the original targets => y_ with the log of the predicted targets => y.
The goal of calculating the cross-entropy loss function is to find the probability that an observation belongs to a particular class or group in the classification problem.
It is this measure (i.e., the cross-entropy loss) that is minimized by the optimization function of which Gradient Descent is a popular example to find the best set of parameters for W that will improve the performance of the classifier. We say the loss is minimized because the lower the loss or cost of error, the better the model.
We are doing element wise multiplication here: y_ * tf.log(y)
I'm looking at the policy gradients sample in this notebook: https://github.com/ageron/handson-ml/blob/master/16_reinforcement_learning.ipynb
The relevant code is here:
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
# Run training over a bunch of instances of inputs
for step in range(n_max_steps):
action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs.reshape(1, n_inputs)})
...
# Then weight each gradient by the action values, average, and feed them back into training_op to apply_gradients()
The above works fine, as each run() returns different gradients.
I'd like to batch all this, and feed an array of inputs into run() instead of one input at a time (my environment is different than the one in the sample, so it makes sense for me to batch, and improve performance). Ie:
action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs_array})
Where obs_array has shape [n_instances, n_inputs].
The problem is that optimizer.compute_gradients(cross_entropy) seems to return a single gradient, even though cross_entropy is a 1d tensor of shape [None, 1]. action_val does return a 1d tensor of actions, as expected - one action per instance in the batch.
Is there any way for me to get an array of gradients, one per instance in the batch?
The problem is that optimizer.compute_gradients(cross_entropy) seems to return a single gradient, even though cross_entropy is a 1d tensor of shape [None, 1].
That happens by design, as the gradient terms for each tensor are automatically aggregated. Gradient computation operations such as optimizer.compute_gradients and the low-level primitive tf.gradients make a sum of all gradient operations, according to the default AddN aggregation method. This is fine for most cases of stochastic gradient descent.
In the end unfortunately, gradient computation will have to be made over a single batch. Of course, unless a custom gradient function is built, or the TensorFlow API is extended to provide gradient computation without full aggregation. Changing the implementation of tf.gradients to do this does not seem to be very trivial.
One trick that you might wish to employ for your reinforcement learning model is to perform multiple session runs in parallel. According to the FAQ, the Session API supports multiple concurrent steps, and will take advantage of the existing resources for parallel computation. The question Asynchronous computation in TensorFlow shows how to do this.
One weak solution I came up with is to create an array of gradient operations, one per instance in the batch, which I can then run all at the same time:
X = tf.placeholder(tf.float32, shape=[minibatch_size, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
hidden2 = tf.layers.dense(hidden, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden2, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
# Calculate gradients per batch instance - for minibatch training
batch_gradients = []
for instance_cross_entropy in tf.unstack(cross_entropy):
instance_grads_and_vars = optimizer.compute_gradients(instance_cross_entropy)
instance_gradients = [grad for grad, variable in instance_grads_and_vars]
batch_gradients.append(instance_gradients)
# Calculate gradients for just one instance - for single instance training
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
# Create gradient placeholders
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
# In the end we only apply a single set of averaged gradients
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
while step < len(obs_array) - minibatch_size:
action_array, batch_gradients_array = sess.run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]})
for action_val, gradient in zip(action_array, batch_gradients_array):
action_vals.append(action_val)
current_gradients.append(gradient)
step += minibatch_size
The main points are that I need to specify the batch size for placeholder X, I can't leave it open ended, otherwise unstack has no idea how many elements to unstack. I unstack cross_entropy to get cross_entropy per instance, then I call compute_gradients per instance. During training I run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]}), which gives me the separate gradients per batch.
This is all well and good, but it doesn't give me much of a performance boost. I only get a max speedup of 2x. Increasing the batch size past 5 just scales the runtime of run() linearly, and gives no gain.
It's sad that Tensorflow can calculate and aggregate gradients over hundreds of instances blazingly fast, but requesting the gradients one by one is so much slower. Might need to dig into the source next...