Considering the example code.
I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)
This is an example that could be used but where do I introduce this ?
In the def of RNN
lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Split data because rnn cell needs a list of inputs for the RNN inner loop
_X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)
But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?
Do I have to define my own Optimizer for this or is there a simpler option?
Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.
In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)
Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))
Clipping each gradient matrix individually changes their relative scale but is also possible:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
None if gradient is None else tf.clip_by_norm(gradient, 5.0)
for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))
In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:
optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))
It's easy for tf.keras!
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
This optimizer will clip all gradients to values between [-1.0, 1.0].
See the docs.
This is actually properly explained in the documentation.:
Calling minimize() takes care of both computing the gradients and
applying them to the variables. If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
And in the example they provide they use these 3 steps:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.
For those who would like to understand the idea of gradient clipping (by norm):
Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.
Let the gradient be g and the max_norm_threshold be j.
Now, if ||g|| > j , we do:
g = ( j * g ) / ||g||
This is the implementation done in tf.clip_by_norm
IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:
original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)
This way you only have to define this once, and not run it after every gradients calculation.
Documentation:
https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm
Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .
clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars
where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.
After clipping we simply apply its value using an optimizer.
optimizer.apply_gradients(clipped_value)
Method 1
if you are training your model using your custom training loop then the one update step will look like
'''
for loop over full dataset
X -> training samples
y -> labels
'''
optimizer = tf.keras.optimizers.Adam()
for x, y in train_Data:
with tf.GradientTape() as tape:
prob = model(x, training=True)
# calculate loss
train_loss_value = loss_fn(y, prob)
# get gradients
gradients = tape.gradient(train_loss_value, model.trainable_weights)
# clip gradients if you want to clip by norm
gradients = [(tf.clip_by_norm(grad, clip_norm=1.0)) for grad in gradients]
# clip gradients via values
gradients = [(tf.clip_by_value(grad, clip_value_min=-1.0, clip_value_max=1.0)) for grad in gradients]
# apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
Method 2
Or you could also simply just replace the first line in above code as below
# for clipping by norm
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# for clipping by value
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)
second method will also work if you are using model.compile -> model.fit pipeline.
Related
In the train_step function of a Keras Functional Model, we have:
def train_step(self, data):
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True) # Forward pass
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(y, y_pred)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
I want to be able to change the self.weights inside the function. When it calls y_pred = self(x, training=True) it will use the original weights, but I want to change these weights before computing the gradient and applying self.optimizer.apply_gradients(zip(gradients, trainable_vars)). The idea is to apply the gradient based on the loss with previous weights to the new weights.
1- I first tried with self.set_weights(x) after computing the loss and before computing the gradients but this error arises: RuntimeError: Cannot get value inside Tensorflow graph function.
2- I saw that this could be resolved with run_eagerly=True but then WARNING:tensorflow:5 out of the last 5 calls to <function pfor.<locals>.f at 0x29d89bf70> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating #tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your #tf.function outside of the loop. For (2), #tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing
3- Moreover, the results with run_eagerly=True are quite different, so I am not sure if this solution is working correctly.
Is it possible to change the weights before the gradient computation as I want with Keras?
Extra clarification: let's imagine we have two arrays of weights A and B, and these arrays are correlated. I want to get the LOSS with A weights but then optimize B weights with LOSS. Optimizing B from the LOSS of A makes sense because these weight arrays are correlated.
I'm trying to implement linear classifier in PyTorch, using 1 layer with tensors W and b, softmax and cross entropy loss. For each batch I have to:
Calculate logits
Transform logits to probabilities with softmax
Compute most probable classes
Calculate cross entropy between true and predicted classes
Use an optimizer to change W and b
So far I have (I have flat MNIST loaded with Scikit-learn):
# convert Numpy arrays to PyTorch tensor Variables
input_X_train = torch.from_numpy(X_train_flat).float().to(device)
input_X_val = torch.from_numpy(X_val_flat).float().to(device)
input_X_test = torch.from_numpy(X_test_flat).float().to(device)
input_y_train = torch.from_numpy(y_train).long().to(device)
input_y_val = torch.from_numpy(y_val).long().to(device)
input_y_test = torch.from_numpy(y_test).long().to(device)
# model parameters: W and b
W = torch.randn(input_dim, output_dim, device=device, dtype=dtype, requires_grad=True)
b = torch.randn(1, device=device, dtype=dtype, requires_grad=True)
BATCH_SIZE = 512
EPOCHS = 40
LEARNING_RATE = 1e-6
# create torch.optim.Adam optimizer for loss function minimization
optimizer = torch.optim.Adam([W, b], lr=LEARNING_RATE)
# create negative log loss function object for loss function evaluation
# use mean loss value from all batch samples
loss_fn = torch.nn.NLLLoss(reduction="mean")
for t in range(EPOCHS):
# logits for input_X, resulting shape should be [input_X.shape[0], 10]
logits = torch.matmul(input_X_train, W) + b
# apply torch.nn.functional.softmax (torch_F.softmax) to logits
probas = torch_f.softmax(logits, dim=1)
# apply torch.argmax to find a class index with highest probability
classes = torch.argmax(probas, dim=1)
# loss should be a scalar number: average loss over all the objects with torch.mean()
# PyTorch implements negative log loss (NLL) *without* log - you have to first compute log of
# softmax, then negative log loss, which will swap sign
# Use torch.nn.functional.log_softmax (torch_f.log_softmax) on top of input_y and logits
# It is identical to calculating cross-entropy (log and then NLL) on top of probas,
# but is more numerically friendly (read the docs).
log_probas = torch_f.log_softmax(logits, dim=1)
loss = loss_fn(log_probas, input_y_train)
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# calculate backward gradients for backpropagation
loss.backward()
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()
For some reason, the W and b don't change. What am I doing wrong?
EDIT:
I've seen and tried in the code above e. g. this minimal working example https://discuss.pytorch.org/t/minimal-working-example-of-optim-sgd/11623/2.
EDIT 2:
Gradients W.grad are often, I think it should not be like that. Probabilities of classes are definitely right (so it's not e. g. like this example), since I've checked sum of every row and probabilities of all classes for each sample sum to 1.
I am using tensorflow to optimize a simple least squares objective function like the following:
Here, Y is the target vector ,X is the input matrix and vector w represents the weights to be learned.
Example Scenario:
, ,
If I wanted to augment the initial objective function to impose an additional constraint on w1 (the first scalar value in the tensorflow Variable w and X1 represents the first column of the feature matrix X), how would I achieve this in tensorflow?
One solution I can think of is to use tf.slice to index the first value of $w$ and add this in addition to the original cost term but I am not convinced that it will have the desired effect on the weights.
I would appreciate inputs on whether something like this is possible in tensorflow and if so, what the best ways to implement this might be?
An alternate option would be to add weight constraints, and do it using an augmented Lagrangian objective but I would first like to explore the regularization option before going the Lagrangian route.
The current code I have for the initial objective function without additional regularization is the following:
train_x ,train_y are the training data, training targets respectively.
test_x , test_y are the testing data, testing targets respectively.
#Sum of Squared Errs. Cost.
def costfunc(predicted,actual):
return tf.reduce_sum(tf.square(predicted - actual))
#Mean Squared Error Calc.
def prediction(sess,X,y_,test_x,test_y):
pred_y = sess.run(y_,feed_dict={X:test_x})
mymse = tf.reduce_mean(tf.square(pred_y - test_y))
mseval=sess.run(mymse)
return mseval,pred_y
with tf.Session() as sess:
X = tf.placeholder(tf.float32,[None,num_feat]) #Training Data
Y = tf.placeholder(tf.float32,[None,1]) # Target Values
W = tf.Variable(tf.ones([num_feat,1]),name="weights")
init = tf.global_variables_initializer()
sess.run(init)
#Tensorflow ops and cost function definitions.
y_ = tf.matmul(X,W)
cost_history = np.empty(shape=[1],dtype=float)
out_of_sample_cost_history = np.empty(shape=[1],dtype=float)
cost=costfunc(y_,Y)
learning_rate = 0.000001
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
for epoch in range(training_epochs):
sess.run(training_step,feed_dict={X:train_x,Y:train_y})
cost_history = np.append(cost_history,sess.run(cost,feed_dict={X: train_x,Y: train_y}))
out_of_sample_cost_history = np.append(out_of_sample_cost_history,sess.run(cost,feed_dict={X:test_x,Y:test_y}))
MSETest,pred_test = prediction(sess,X,y_,test_x,test_y) #Predict on full testing set.
tf.slice will do. And during optimization, the gradients to w1 will be added (because gradients add up at forks). Also, please check the graph on Tensorboard (the link on how to use it).
Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)
zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues
During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!
In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.
I'm looking at the policy gradients sample in this notebook: https://github.com/ageron/handson-ml/blob/master/16_reinforcement_learning.ipynb
The relevant code is here:
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
# Run training over a bunch of instances of inputs
for step in range(n_max_steps):
action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs.reshape(1, n_inputs)})
...
# Then weight each gradient by the action values, average, and feed them back into training_op to apply_gradients()
The above works fine, as each run() returns different gradients.
I'd like to batch all this, and feed an array of inputs into run() instead of one input at a time (my environment is different than the one in the sample, so it makes sense for me to batch, and improve performance). Ie:
action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs_array})
Where obs_array has shape [n_instances, n_inputs].
The problem is that optimizer.compute_gradients(cross_entropy) seems to return a single gradient, even though cross_entropy is a 1d tensor of shape [None, 1]. action_val does return a 1d tensor of actions, as expected - one action per instance in the batch.
Is there any way for me to get an array of gradients, one per instance in the batch?
The problem is that optimizer.compute_gradients(cross_entropy) seems to return a single gradient, even though cross_entropy is a 1d tensor of shape [None, 1].
That happens by design, as the gradient terms for each tensor are automatically aggregated. Gradient computation operations such as optimizer.compute_gradients and the low-level primitive tf.gradients make a sum of all gradient operations, according to the default AddN aggregation method. This is fine for most cases of stochastic gradient descent.
In the end unfortunately, gradient computation will have to be made over a single batch. Of course, unless a custom gradient function is built, or the TensorFlow API is extended to provide gradient computation without full aggregation. Changing the implementation of tf.gradients to do this does not seem to be very trivial.
One trick that you might wish to employ for your reinforcement learning model is to perform multiple session runs in parallel. According to the FAQ, the Session API supports multiple concurrent steps, and will take advantage of the existing resources for parallel computation. The question Asynchronous computation in TensorFlow shows how to do this.
One weak solution I came up with is to create an array of gradient operations, one per instance in the batch, which I can then run all at the same time:
X = tf.placeholder(tf.float32, shape=[minibatch_size, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
hidden2 = tf.layers.dense(hidden, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden2, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
# Calculate gradients per batch instance - for minibatch training
batch_gradients = []
for instance_cross_entropy in tf.unstack(cross_entropy):
instance_grads_and_vars = optimizer.compute_gradients(instance_cross_entropy)
instance_gradients = [grad for grad, variable in instance_grads_and_vars]
batch_gradients.append(instance_gradients)
# Calculate gradients for just one instance - for single instance training
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
# Create gradient placeholders
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
# In the end we only apply a single set of averaged gradients
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
while step < len(obs_array) - minibatch_size:
action_array, batch_gradients_array = sess.run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]})
for action_val, gradient in zip(action_array, batch_gradients_array):
action_vals.append(action_val)
current_gradients.append(gradient)
step += minibatch_size
The main points are that I need to specify the batch size for placeholder X, I can't leave it open ended, otherwise unstack has no idea how many elements to unstack. I unstack cross_entropy to get cross_entropy per instance, then I call compute_gradients per instance. During training I run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]}), which gives me the separate gradients per batch.
This is all well and good, but it doesn't give me much of a performance boost. I only get a max speedup of 2x. Increasing the batch size past 5 just scales the runtime of run() linearly, and gives no gain.
It's sad that Tensorflow can calculate and aggregate gradients over hundreds of instances blazingly fast, but requesting the gradients one by one is so much slower. Might need to dig into the source next...