I'm trying to impelement this article:
http://ronan.collobert.com/pub/matos/2008_deep_icml.pdf
Specfically the equation (3) from section 2.
Shortly I want to do a pairwise distance computation for the features of each mini-batch and insert this loss to the general network loss.
I have only the Tesnor of the batch (16 samples), the labels tensor of the batch and the batch feature Tensor.
After looking for quite a while I still couldn't figure out the following:
1) How do I divide the batch for Positive (i.e. same label) and negative pairs. Since Tensor are not iterateble I can't figure out how to get which sample have which label and then divide my vector, or get which indices of the tensor belong to each class.
2) How can I do pairwise distance calculation for some of the indices in the batch tensor?
3) I also need to define a new distance function for negative examples
Overall, I need to get which indices belong to which class, do a positive pair-wise distace calculation for all positive pairs. And do another calculation for all negative pairs. Then sum it all up and add it to the network loss.
Any help (to one of more of the 3 issues) would be highly appreciated.
1)
You should do the pair sampling before feeding the data into a session. Label every pair a boolean label, say y = 1 for matched-pair, 0 otherwise.
2) 3) Just calculate both pos/neg terms for every pair, and let the 0-1 label y to choose which to add to the loss.
First create placeholders, y_ is for boolean labels.
dim = 64
x1_ = tf.placeholder('float32', shape=(None, dim))
x2_ = tf.placeholder('float32', shape=(None, dim))
y_ = tf.placeholder('uint8', shape=[None]) # uint8 for boolean
Then the loss tensor can be created by the function.
def loss(x1, x2, y):
# Euclidean distance between x1,x2
l2diff = tf.sqrt( tf.reduce_sum(tf.square(tf.sub(x1, x2)),
reduction_indices=1))
# you can try margin parameters
margin = tf.constant(1.)
labels = tf.to_float(y)
match_loss = tf.square(l2diff, 'match_term')
mismatch_loss = tf.maximum(0., tf.sub(margin, tf.square(l2diff)), 'mismatch_term')
# if label is 1, only match_loss will count, otherwise mismatch_loss
loss = tf.add(tf.mul(labels, match_loss), \
tf.mul((1 - labels), mismatch_loss), 'loss_add')
loss_mean = tf.reduce_mean(loss)
return loss_mean
loss_ = loss(x1_, x2_, y_)
Then feed your data (random generated for example):
batchsize = 4
x1 = np.random.rand(batchsize, dim)
x2 = np.random.rand(batchsize, dim)
y = np.array([0,1,1,0])
l = sess.run(loss_, feed_dict={x1_:x1, x2_:x2, y_:y})
Short answer
I think the simplest way to do that is to sample the pairs offline (i.e. outside of the TensorFlow graph).
You create tf.placeholder for a batch of pairs along with their labels (positive or negative, i.e. same class or different class), and then you can compute in TensorFlow the corresponding loss.
With the code
You sample the pairs offline. You sample batch_size pairs of inputs, and output the batch_size left elements of the pairs of shape [batch_size, input_size]. You also output the labels of the pairs (either positive of negative) of shape [batch_size,]
pairs_left = np.zeros((batch_size, input_size))
pairs_right = np.zeros((batch_size, input_size))
labels = np.zeros((batch_size, 1)) # ex: [[0.], [1.], [1.], [0.]] for batch_size=4
Then you create Tensorflow placeholders corresponding to these inputs. In your code, you will feed the previous inputs to these placeholders in the feed_dict argument of sess.run()
pairs_left_node = tf.placeholder(tf.float32, [batch_size, input_size])
pairs_right_node = tf.placeholder(tf.float32, [batch_size, input_size])
labels_node = tf.placeholder(tf.float32, [batch_size, 1])
Now we can perform a feedforward on the inputs (let's say your model is a linear model).
W = ... # shape [input_size, feature_size]
output_left = tf.matmul(pairs_left_node, W) # shape [batch_size, feature_size]
output_right = tf.matmul(pairs_right_node, W) # shape [batch_size, feature_size]
Finally we can compute the pairwise loss.
l2_loss_pairs = tf.reduce_sum(tf.square(output_left - output_right), 1)
positive_loss = l2_loss_pairs
negative_loss = tf.nn.relu(margin - l2_loss_pairs)
final_loss = tf.mul(labels_node, positive_loss) + tf.mul(1. - labels_node, negative_loss)
And that's it ! You can now optimize on this loss, with a good offline sampling.
Related
Background
In Tensorflow 2, there exists a class called GradientTape which is used to record operations on tensors, the result of which can then be differentiated and fed to some minimization algorithm. For example, from the documentation we have this example:
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
The docstring for the gradient method implies that the first argument can be not just a tensor, but a list of tensors:
def gradient(self,
target,
sources,
output_gradients=None,
unconnected_gradients=UnconnectedGradients.NONE):
"""Computes the gradient using operations recorded in context of this tape.
Args:
target: a list or nested structure of Tensors or Variables to be
differentiated.
sources: a list or nested structure of Tensors or Variables. `target`
will be differentiated against elements in `sources`.
output_gradients: a list of gradients, one for each element of
target. Defaults to None.
unconnected_gradients: a value which can either hold 'none' or 'zero' and
alters the value which will be returned if the target and sources are
unconnected. The possible values and effects are detailed in
'UnconnectedGradients' and it defaults to 'none'.
Returns:
a list or nested structure of Tensors (or IndexedSlices, or None),
one for each element in `sources`. Returned structure is the same as
the structure of `sources`.
Raises:
RuntimeError: if called inside the context of the tape, or if called more
than once on a non-persistent tape.
ValueError: if the target is a variable or if unconnected gradients is
called with an unknown value.
"""
In the above example, it is easy to see that y, the target, is the function to be differentiated, and x is the dependent variable the "gradient" is taken with respect to.
From my limited experience, it appears that the gradient method returns a list of tensors, one per each element of sources, and each of these gradients is a tensor that is the same shape as the corresponding member of sources.
Question
The above description of the behavior of gradients makes sense if target contains a single 1x1 "tensor" to be differentiated, because mathematically a gradient vector should be the same dimension as the domain of the function.
However, if target is a list of tensors, the output of gradients is still the same shape. Why is this the case? If target is thought of as a list of functions, shouldn't the output resemble something like a Jacobian? How am I to interpret this behavior conceptually?
This is how tf.GradientTape().gradient() is defined. It has the same functionality as the tf.gradients(), except that the latter can't be used in eager mode. From the docs of tf.gradients():
It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys
where xs are sources and ys are target.
Example 1:
So let's say target = [y1, y2] and sources = [x1, x2]. The result will be:
[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]
Example 2:
Compute gradients for loss-per-sample (tensor) vs reduced loss (scalar)
Let w, b be two variables.
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
= [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db]
== 0.5 * grads
Tensorflow example of the above snippet:
import tensorflow as tf
print(tf.__version__) # 2.1.0
inputs = tf.convert_to_tensor([[0.1, 0], [0.5, 0.51]]) # two two-dimensional samples
w = tf.Variable(initial_value=inputs)
b = tf.Variable(tf.zeros((2,)))
labels = tf.convert_to_tensor([0, 1])
def forward(inputs, labels, var_list):
w, b = var_list
logits = tf.matmul(inputs, w) + b
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits)
return xentropy
# `xentropy` has two elements (gradients of tensor - gradient
# of each sample in a batch)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads = g.gradient(xentropy, [w, b])
print(xentropy.numpy()) # [0.6881597 0.71584916]
print(grads[0].numpy()) # [[ 0.20586157 -0.20586154]
# [ 0.2607238 -0.26072377]]
# `reduced_xentropy` is scalar (gradients of scalar)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads_reduced = g.gradient(reduced_xentropy, [w, b])
print(reduced_xentropy.numpy()) # 0.70200443 <-- scalar
print(grads_reduced[0].numpy()) # [[ 0.10293078 -0.10293077]
# [ 0.1303619 -0.13036188]]
If you compute loss (xentropy) for each element in a batch the final gradients of each variable will be the sum of all gradients for each sample in a batch (which makes sense).
I am new to tensorflow. I experimented a DQN algorithm with a section involving
a = tf.placeholder(tf.int32, shape = [None],name='A')
q = tf.reduce_sum(critic_q * tf.one_hot(a,n_outputs),axis=1,keepdims=True,name='Q')#Q value for chosen action
y = tf.placeholder(tf.float32, shape = [None],name='Y')
learning_rate = 1e-4
cost = tf.reduce_mean(tf.square(y-q))#mean squared error
global_step = tf.Variable(0,trainable=False,name='global_step')
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(cost,global_step=global_step)
and initialized the input y with y_batch=np.zeros(nbatch). The network hardly trained at all.
Then, I switched to defining y as
y = tf.placeholder(tf.float32, shape = [None,1],name='Y')
and initialized the input with y_batch=np.zeros(nbatch).reshape(-1,1), which worked nicely.
What was happening in the first implementation?
Every tensor has a rank (number of dimensions) and a set of dimensions.
A placeholder with shape [1] is a placeholder with rank 1 and the dimension in position 0 of 1.
A placeholder with shape [None, 1] is a placeholder with rank 2, hence it has 2 dimensions. The first dimension (index 0) has unknown size (it will be resolved at runtime) while the second dimension (index 1) has the known size of 1.
In order to be compatible, tensors must have the same rank a dimensions.
You can read a more complete assessment about the tensors shape here: https://pgaleone.eu/tensorflow/2018/07/28/understanding-tensorflow-tensors-shape-static-dynamic/#tensors-the-basic
I am using pyswarms PSO for neural network optimisation. I am trying to create a network of input layer and output layer.
# Store the features as X and the labels as y
X = np.random.randn(25000,20)
y = np.random.random_integers(0,2,25000)
# In[29]:
def sigmoid(x):
return 1 / (1 + math.exp(-x))
# In[58]:
print(X_train.shape)
print(y_train.shape)
# In[63]:
# Forward propagation
def forward_prop(params):
"""Forward propagation as objective function
This computes for the forward propagation of the neural network, as
well as the loss. It receives a set of parameters that must be
rolled-back into the corresponding weights and biases.
Inputs
------
params: np.ndarray
The dimensions should include an unrolled version of the
weights and biases.
Returns
-------
float
The computed negative log-likelihood loss given the parameters
"""
# Neural network architecture
n_inputs = 20
n_classes = 2
# Roll-back the weights and biases
W1 = params[0:40]
# Perform forward propagation
z1 = X.dot(W1) # Pre-activation in Layer 1
#a1 = np.tanh(z1) # Activation in Layer 1
#z2 = a1.dot(W2) + b2 # Pre-activation in Layer 2
logits = z1 # Logits for Layer 2
# Compute for the softmax of the logits
exp_scores = np.exp(logits)
probs = exp_scores / np.sum(exp_scores,axis=1)
# Compute for the negative log likelihood
N = 25000 # Number of samples
corect_logprobs = -np.log(probs[range(N), y])
loss = np.sum(corect_logprobs) / N
return loss
# In[64]:
def f(x):
# Compute for the negative log likelihood
"""Higher-level method to do forward_prop in the
whole swarm.
Inputs
------
x: numpy.ndarray of shape (n_particles, dimensions)
The swarm that will perform the search
Returns
-------
numpy.ndarray of shape (n_particles, )
The computed loss for each particle
"""
n_particles = x.shape[0]
j = [forward_prop(x[i]) for i in range(n_particles)]
return np.array(j)
# In[65]:
# Initialize swarm
options = {'c1': 0.5, 'c2': 0.3, 'w':0.9}
# Call instance of PSO
dimensions = 20
optimizer = ps.single.GlobalBestPSO(n_particles=100, dimensions=dimensions, options=options)
# Perform optimization
cost, pos = optimizer.optimize(f, print_step=100, iters=1000, verbose=3)
I modified the code the from the examples but I am getting errors
AxisError: axis 1 is out of bounds for array of dimension 1.
Moreover, this example implements softmax function in last layer. How should I use it with different loss functions?
Original code can be found here.
I am trying to implement simple autoencoder like below.
The number of input features are 2, and I want to build sparse autoencoder for dimension reduction to feature 1. I selected the number of nodes are 2(input), 8(hidden), 1(reduced feature), 8(hidden), 2(output) to add some more complexity than using only (2, 1, 2) nodes. The number of samples N is around 10000.
'DATA' is a just a 2x10000 matrix containing integer values.
import tensorflow as tf
x = tf.placeholder(shape=[None, 2])
w1 = tf.Variable(tf.random_normal(shape=[2, 8]))
w2 = tf.Variable(tf.random_normal(shape=[8, 1]))
h1 = tf.nn.relu(tf.matmul(x, w1))
encoded = tf.matmul(h1, w2)
h2 = tf.nn.relu(encoded)
h3 = tf.nn.relu(tf.matmul(h2, tf.transpose(w2)))
y = tf.matmul(h3, tf.transpose(w1))
mse = tf.reduce_mean(tf.squared_difference(x, y))
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)
sess = tf.Session()
sess.run(init)
fd = {x: DATA}
loss_value, reduced_feature = sess.run([mse, encoded], feed_dict=fd)
I have 2 questions with the implementation, as the result was quite different as I expected.
Is this implementation correct? Will the variable 'reduced_feature' show the reduced feature(1d feature) from 2 feature inputs?
Should I add some sparsity condition if I want to use more hidden nodes than input? If yes, can you show some sample code for this task?
I am trying to use LSTM with inputs with different time steps (different number of frames). The input to the rnn.static_rnn should be a sequence of tf (not a tf!). So, I should convert my input to sequence. I tried to use tf.unstack and tf.split, but both of them need to know exact size of inputs, while one dimension of my inputs (time steps) is changing by different inputs. following is part of my code:
n_input = 256*256 # data input (img shape: 256*256)
n_steps = None # timesteps
batch_size = 1
# tf Graph input
x = tf.placeholder("float", [ batch_size , n_input,n_steps])
y = tf.placeholder("float", [batch_size, n_classes])
# Permuting batch_size and n_steps
x1 = tf.transpose(x, [2, 1, 0])
x1 = tf.transpose(x1, [0, 2, 1])
x3=tf.unstack(x1,axis=0)
#or x3 = tf.split(x2, ?, 0)
# Define a lstm cell with tensorflow
lstm_cell = rnn.BasicLSTMCell(num_units=n_hidden, forget_bias=1.0)
# Get lstm cell output
outputs, states = rnn.static_rnn(lstm_cell, x3, dtype=tf.float32,sequence_length=None)
I got following error when I am using tf.unstack:
ValueError: Cannot infer num from shape (?, 1, 65536)
Also, there are some discussions here and here, but none of them were useful for me. Any help is appreciated.
As explained in here, tf.unstack does not work if the argument is unspecified and non-inferrable.
In your code, after transpositions, x1 has the shape of [ n_steps, batch_size, n_input] and its value at axis=0 is set to None.