Tensorflow embedding_lookup - python

I am trying to learn the word representation of the imdb dataset "from scratch" through the TensorFlow tf.nn.embedding_lookup() function. If I understand it correctly, I have to set up an embedding layer before the other hidden layer, and then when I perform gradient descent, the layer will "learn" a word representation in the weights of this layer. However, when I try to do this, I get a shape error between my embedding layer and the first fully-connected layer of my network.
def multilayer_perceptron(_X, _weights, _biases):
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W")
embedding_layer = tf.nn.embedding_lookup(W, _X)
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(embedding_layer, _weights['h1']), _biases['b1']))
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, _weights['h2']), _biases['b2']))
return tf.matmul(layer_2, weights['out']) + biases['out']
x = tf.placeholder(tf.int32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])
pred = multilayer_perceptron(x, weights, biases)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred,y))
train_step = tf.train.GradientDescentOptimizer(0.3).minimize(cost)
init = tf.initialize_all_variables()
The error I get is:
ValueError: Shapes TensorShape([Dimension(None), Dimension(300), Dimension(128)])
and TensorShape([Dimension(None), Dimension(None)]) must have the same rank

The shape error arises because you are using a two-dimensional tensor, x to index into a two-dimensional embedding tensor W. Think of tf.nn.embedding_lookup() (and its close cousin tf.gather()) as taking each integer value i in x and replacing it with the row W[i, :]. From the error message, one can infer that n_input = 300 and embedding_size = 128. In general, the result of tf.nn.embedding_lookup() number of dimensions equal to rank(x) + rank(W) - 1… in this case, 3. The error arises when you try to multiply this result by _weights['h1'], which is a (two-dimensional) matrix.
To fix this code, it depends on what you're trying to do, and why you are passing in a matrix of inputs to the embedding. One common thing to do is to aggregate the embedding vectors for each input example into a single row per example using an operation like tf.reduce_sum(). For example, you might do the following:
W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0) ,name="W")
embedding_layer = tf.nn.embedding_lookup(W, _X)
# Reduce along dimension 1 (`n_input`) to get a single vector (row)
# per input example.
embedding_aggregated = tf.reduce_sum(embedding_layer, [1])
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(
embedding_aggregated, _weights['h1']), _biases['b1']))

One another possible solution is : Instead of adding the embedding vectors, concatenate these vectors into a single vector and increase the number of neurons in the hidden layer.
I used :
embedding_aggregated = tf.reshape(embedding_layer, [-1, embedding_size * sequence_length])
Also, i changed the number of neurons in hidden layer to embedding_size * sequence_length.
Observation : Accuracy also improved on using concatenation rather than addition.

Related

Use three transformations (average, max, min) of pretrained embeddings to a single output layer in Pytorch

I have developed a trivial Feed Forward neural network with Pytorch.
The neural network uses GloVe pre-trained embeddings in a freezed nn.Embeddings layer.
Next, the embedding layer splits into three embeddings. Each split is a different transformation applied to the initial embedding layer. Then the embeddings layer feed three nn.Linear layers. And finally I have a single output layer for a binary classification target.
The shape of the embedding tensor is [64,150,50]
-> 64: sentences in the batch,
-> 150: words per sentence,
-> 50: vector-size of a single word (pre-trained GloVe vector)
So after the transformation, the embedding layer splits into three layers with shape [64,50], where 50 = either the torch.mean(), torch.max() or torch.min() of the 150 words per sentence.
My questions are:
How could I feed the output layer from three different nn.Linear layers to predict a single target value [0,1].
Is this efficient and helpful to the total predictive power of the model? Or just selecting the average of the embeddings is sufficient and no improvement will be observed.
The forward() method of my PyTorch model is:
def forward(self, text):
embedded = self.embedding(text)
if self.use_pretrained_embeddings:
embedded_average = torch.mean(embedded, dim=1)
embedded_max = torch.max(embedded, dim=1)[0]
embedded_min = torch.min(embedded, dim=1)[0]
else:
embedded = self.flatten_layer(embedded)
input_layer = self.input_layer(embedded_average) #each Linear layer has the same value of hidden unit
input_layer = self.activation(input_layer)
input_layer_max = self.input_layer(embedded_max)
input_layer_max = self.activation(input_layer_max)
input_layer_min = self.input_layer(embedded_min)
input_layer_min = self.activation(input_layer_min)
#What should I do here? to exploit the weights of the 3 hidden layers
output_layer = self.output_layer(input_layer)
output_layer = self.activation_output(output_layer) #Sigmoid()
return output_layer
After the proposed answer the function is:
def forward(self, text):
embedded = self.embedding(text)
if self.use_pretrained_embeddings:
embedded_average = torch.mean(embedded, dim=1)
embedded_max = torch.max(embedded, dim=1)[0]
embedded_min = torch.min(embedded, dim=1)[0]
#use of average embeddings transformation
input_layer_average = self.input_layer(embedded_average)
input_layer_average = self.activation(input_layer_average)
#use of max embeddings transformation
input_layer_max = self.input_layer(embedded_max)
input_layer_max = self.activation(input_layer_max)
#use of min embeddings transformation
input_layer_min = self.input_layer(embedded_min)
input_layer_min = self.activation(input_layer_min)
else:
embedded = self.flatten_layer(embedded)
input_layer = torch.concat([input_layer_average, input_layer_max, input_layer_min], dim=1)
input_layer = self.activation(input_layer)
print("3",input_layer.shape) #[192,1] vs [64,1] -> output layer
if self.n_layers !=0:
for layer in self.layers:
input_layer = layer(input_layer)
output_layer = self.output_layer(input_layer)
output_layer = self.activation_output(output_layer)
return output_layer
This generates the following error:
ValueError: Using a target size (torch.Size([64, 1])) that is different to the input size (torch.Size([192, 1])) is deprecated. Please ensure they have the same size.
Expected outcome since the concatenated layer is 3x the size of the sentences (64). Any fix that could resolve it?
Regarding 1: You can use torch.concat to concatenate the outputs along the appropriate dimension, and then e.g. map them to a single output using another linear layer.
Regarding 2: You will have to try it yourself and see whether this is useful.

Why does this TensorFlow example not have a summation before the activation function?

I'm trying to understand a TensorFlow code snippet. What I've been taught is that we sum all the incoming inputs and then pass them to an activation function. Shown in the picture below is a single neuron. Notice that we compute a weighted sum of the inputs and THEN compute the activation.
In most examples of the multi-layer perceptron, they don't include the summation step. I find this very confusing.
Here is an example of one of those snippets:
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
# Create model
def multilayer_perceptron(x):
# Hidden fully connected layer with 256 neurons
layer_1 = tf.nn.relu(tf.add(tf.matmul(x, weights['h1']), biases['b1']))
# Hidden fully connected layer with 256 neurons
layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, weights['h2']), biases['b2']))
# Output fully connected layer with a neuron for each class
out_layer = tf.nn.relu(tf.matmul(layer_2, weights['out']) + biases['out'])
return out_layer
In each layer, we first multiply the inputs with a weights. Afterwards, we add the bias term. Then we pass those to the tf.nn.relu. Where does the summation happen? It looks like we've skipped this!
Any help would be really great!
The last layer of your model out_layer outputs probabilities of each class Prob(y=yi|X) and has shape [batch_size, n_classes]. To calculate these probabilities the softmax
function is applied. For each single input data point x that your model receives it outputs a vector of probabilities y of size number of classes. You then pick the one that has highest probability by applying argmax on the output vector class=argmax(P(y|x)) which can be written in tensorflow as y_pred = tf.argmax(out_layer, 1).
Consider network with a single layer. You have input matrix X of shape [n_samples, x_dimension] and you multiply it by some matrix W that has shape [x_dimension, model_output]. The summation that you're talking about is dot product between the row of matrix X and column of matrix W. The output will then have shape [n_samples, model_output]. On this output you apply activation function (if it is the final layer you probably want softmax). Perhaps the picture that you've shown is a bit misleading.
Mathematically, the layer without bias can be described as and suppose that the first row of matrix (the first row is a single input data point) is
and first column of W is
The result of this dot product is given by
which is your summation. You repeat this for each column in matrix W and the result is vector of size model_output (which correspond to the number of columns in W). To this vector you add bias (if needed) and then apply activation.
The tf.matmul operator performs a matrix multiplication, which means that each element in the resulting matrix is a sum of products (which corresponds exactly to what you describe).
Take a simple example with a row-vector and a column-vector, as would be the case if you had exactly one neuron and an input vector (as per the graphic you shared above);
x = [2,3,1]
y = [3,
1,
2]
Then the result would be:
tf.matmul(x, y) = 2*3 + 3*1 +1*2 = 11
There you can see the weighted sum.
p.s: tf.multiply performs element-wise multiplication, which is not what we want here.

Bi-LSTM Attention model in Keras

I am trying to make an attention model with Bi-LSTM using word embeddings. I came across How to add an attention mechanism in keras?, https://github.com/philipperemy/keras-attention-mechanism/blob/master/attention_lstm.py and https://github.com/keras-team/keras/issues/4962.
However, I am confused about the implementation of Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. So,
_input = Input(shape=[max_length], dtype='int32')
# get the embedding layer
embedded = Embedding(
input_dim=30000,
output_dim=300,
input_length=100,
trainable=False,
mask_zero=False
)(_input)
activations = Bidirectional(LSTM(20, return_sequences=True))(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
I am confused here as to which equation is to what from the paper.
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
What will RepeatVector do?
attention = RepeatVector(20)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
Now, again I am not sure why this line is here.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
Since I have two classes, I will have the final softmax as:
probabilities = Dense(2, activation='softmax')(sent_representation)
attention = Flatten()(attention)
transform your tensor of attention weights in a vector (of size max_length if your sequence size is max_length).
attention = Activation('softmax')(attention)
allows having all the attention weights between 0 and 1, the sum of all the weights equal to one.
attention = RepeatVector(20)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
RepeatVector repeat the attention weights vector (which is of size max_len) with the size of the hidden state (20) in order to multiply the activations and the hidden states element-wise. The size of the tensor variable activations is max_len*20.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
This Lambda layer sum the weighted hidden states vectors in order to obtain the vector that will be used at the end.
Hope this helped!

Initializing a bias term in my nonlinear regression model using TensorFlow

I am trying to make a basic nonlinear regression model that will predict the return index of companies in the FTSE350.
I am unsure as to what my bias term should look like in terms of dimensions and whether I am using it properly in the calculations method:
w1 = tf.Variable(tf.truncated_normal([4, 10], mean=0.0, stddev=1.0, dtype=tf.float64))
b1 = tf.Variable(tf.constant(0.1, shape=[4,10], dtype = tf.float64))
w2 = tf.Variable(tf.truncated_normal([10, 1], mean=0.0, stddev=1.0, dtype=tf.float64))
b2 = tf.Variable(tf.constant(0.1, shape=[1], dtype = tf.float64))
def calculations(x, y):
w1d = tf.matmul(x, w1)
h1 = (tf.nn.sigmoid(tf.add(w1d, b1)))
h1w2 = tf.matmul(h1, w2)
activation = tf.add(tf.nn.sigmoid(tf.matmul(h1, w2)), b2)
error = tf.reduce_sum(tf.pow(activation - y,2))/(len(x))
return [ activation, error ]
My initial thoughts were that it should be the same size as my weights but I get this error:
ValueError: Dimensions must be equal, but are 251 and 4 for 'Add' (op: 'Add') with input shapes: [251,10], [4,10]
I've played around with different ideas but don't seem to be getting anywhere.
(My input data has 4 features)
The network structure I have attempted is 4 neurons in the input layer, 10 in the hidden layer, and 1 in the output later but I feel like I may mixed up the dimensions in my weights layer too?
When you are constructing the layers for a feed-forward fully-connected neural network (like in your example), the shape of the biases should be equal to the number of nodes in the corresponding layer. So in your case, since your weight matrix has a shape of (4, 10), you have 10 nodes in that layer and you should be using:
b1 = tf.Variable(tf.constant(0.1, shape=[10], type = tf.float64))
The reason for this is when you do w1d = tf.matmul(x, w1), you are actually getting a matrix of shape (batch_size, 10) (if batch_size is the number of rows in your input matrix). This is because you are matrix multiplying a (batch_size, 4) matrix by a (4, 10) weight matrix. Then, you are adding a bias across each column of w1d, which can be represented as a 10-dimensional vector, which you would get if you made the shape of b1 [10].
Without the non-linearity (sigmoid) afterward, this is called an affine transformation, which you can read more about here: https://en.wikipedia.org/wiki/Affine_transformation.
Another fantastic resource is the Stanford Deep Learning Tutorial, which has a good explanation of how these feed-forward models work here:
http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/.
Hope that helped!
I think your b1 should just be of dimention 10 and your code should run
Since 4 is the number of features and 10 is the number of neurones in your first layer (i think in term of neural net ...)
then you must add a bias of dimention = 10
Also you might see the biases as adding an extra feature of constant value = 1.
see this pdf if you have time it expalin very well :https://cs.stanford.edu/~quocle/tutorial1.pdf

Are these functions equivalent in TensorFlow?

I'm a newbie wit TensorFlow and I've studying it for the last few days.
I want to understand if the following two functions are equivalent or not:
1.
softmax = tf.add(tf.matmul(x, weights), biases, name=scope.name)
2.
softmax = tf.nn.softmax(tf.matmul(x, weights) + biases, name=scope.name)
If they are in fact different, what is the main difference?
softmax1 = tf.add(tf.matmul(x, weights), biases, name=scope.name)
is not equal to
softmax2 = tf.nn.softmax(tf.matmul(x, weights) + biases, name=scope.name)
since softmax1 has no softmax calculation at all while softmax2 does. See the Tensorflow API for tf.nn.softmax. The general idea of a softmax is that it normalizes the input by rescaling the whole data sequence ensuring their entries are in the interval (0, 1) and the sum is 1.
The only thing that is equal between the two statements is the basic calculation. + does the same thing tf.add does so tf.add(tf.matmul(x, weights), biases) is equal to tf.matmul(x, weights) + biases.
EDIT: To add some clarification (I think you do not know really know what softmax is doing?):
tf.matmul(x, W) + bias
Calculates a matrix multiplication between x (your input vector) and W the weights for the current layer. Afterwards the bias is added.
This calculation models the activation of one layer. Additionally you have an activation function, like the sigmoid function which transforms your activation. So for one layer you normally do something like this:
h1 = tf.sigmoid(tf.matmul(x, W) + bias)
Here h1 would be the activation of this layer.
The softmax operation simply rescales your input. E.g., if you got this activation on your output layer:
output = [[1.0, 2.0, 3.0, 5.0, 0.5, 0.2]]
The softmax rescales this input for fitting the values in the interval (0, 1) and being the sum equal to 1:
tf.nn.softmax(output)
> [[ 0.01497873, 0.0407164 , 0.11067866, 0.81781083, 0.00908506,
0.00673038]]
tf.reduce_sum(tf.nn.softmax(output))
> 1.0

Categories

Resources