I have been following a tutorial that shows how to make a word2vec model.
This tutorial uses this piece of code:
similarity = merge([target, context], mode='cos', dot_axes=0) (no other info was given, but I suppose this comes from keras.layers)
Now, I've researched a bit on the merge method but I couldn't find much about it.
From what I understand, it has been replaced by a lot of functions like layers.Add(), layers.Concat()....
What should I use? There's .Dot(), which has an axis parameter (which seems to be correct) but no mode parameter.
What can I use in this case?
The Dot layer in Keras now supports built-in Cosine similarity using the normalize = True parameter.
From the Keras Docs:
keras.layers.Dot(axes, normalize=True)
normalize: Whether to L2-normalize samples along the dot product axis before taking the dot product. If set to True, then the output of the dot product is the cosine proximity between the two samples.
Source
There are a few things that are unclear from the Keras documentation that I think are crucial to understanding:
For each function in the keras documentation for Merge, there is a lower case and upper case one defined i.e. add() and Add().
On Github, farizrahman4u outlines the differences:
Merge is a layer.
Merge takes layers as input
Merge is usually used with Sequential models
merge is a function.
merge takes tensors as input.
merge is a wrapper around Merge.
merge is used in Functional API
Using Merge:
left = Sequential()
left.add(...)
left.add(...)
right = Sequential()
right.add(...)
right.add(...)
model = Sequential()
model.add(Merge([left, right]))
model.add(...)
using merge:
a = Input((10,))
b = Dense(10)(a)
c = Dense(10)(a)
d = merge([b, c])
model = Model(a, d)
To answer your question, since Merge has been deprecated, we have to define and build a layer ourselves for the cosine similarity. In general this will involve using those lowercase functions, which we wrap within a Lambda to create a layer that we can use within a model.
I found a solution here:
from keras import backend as K
def cosine_distance(vests):
x, y = vests
x = K.l2_normalize(x, axis=-1)
y = K.l2_normalize(y, axis=-1)
return -K.mean(x * y, axis=-1, keepdims=True)
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0],1)
distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([processed_a, processed_b])
Depending on your data, you may want to remove the L2 normalization. What is important to note about the solution is that it is built using the Keras function api e.g. K.mean() - I think this is necessary when defining custom layer or even loss functions.
Hope I was clear, this was my first SO answer!
Maybe this will help you
(I spent a lot of time to make sure that these are the same things)
import tensorflow as tf
with tf.device('/CPU:' + str(0)):
print(tf.losses.CosineSimilarity()([1.0,1.0,1.0,-1.0],[4.0,4.0,4.0,5.0]))
print(tf.keras.layers.dot([tf.Variable([[1.0,1.0,1.0,-1.0]]),tf.Variable([[4.0,4.0,4.0,5.0]])], axes=1, normalize=True))
Output (Pay attention to the sign):
tf.Tensor(-0.40964404, shape=(), dtype=float32)
tf.Tensor([[0.40964404]], shape=(1, 1), dtype=float32)
If you alter the last code block of the tutorial as follows, you can see that the (average) loss is decreasing nicely with the Dot solution suggested by SantoshGuptaz7 (comment in the question above):
display_after_epoch = 10000
display_after_epoch_2 = 10 * display_after_epoch
loss_sum = 0
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
loss_sum += loss
if cnt % display_after_epoch == 0 and cnt != 0:
print("\nIteration {}, loss={}".format(cnt, loss_sum / cnt))
loss_sum = 0
if cnt % display_after_epoch_2 == 0:
sim_cb.run_sim()
Related
Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1
Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()
I am trying to construct a Keras model model_B that outputs the output of another Keras model model_A. Now, the output of model_A is computed from the concatenation of several tensors coming from multiple Keras embedding layers with different vocabulary sizes. Models model_A and model_B are essentially the same.
Problem: When I train model_A, everything works fine. However, when I train model_B on the same dataset, I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError:
indices[1] = 3 is not in [0, 2) [[{{node model_1/embedding_1/embedding_lookup}}]]
Essentially, the error is saying that the index of a word is outside of the expected vocabulary, but this is not the case. Could someone clarify why this happens?
Here is a reproducible example of the problem:
from keras.layers import Input, Dense, Lambda, Concatenate, Embedding
from keras.models import Model
import numpy as np
# Constants
A = 2
vocab_sizes = [2, 4]
# Architecture
X = Input(shape=(A,))
embeddings = []
for a in range(A):
X_a = Lambda(lambda x: x[:, a])(X)
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
# Model A
model_A = Model(inputs=X, outputs=h)
model_A.compile('sgd', 'mse')
# Model B
Y = Input(shape=(A,))
model_B = Model(inputs=Y, outputs=model_A(Y))
model_B.compile('sgd', 'mse')
# Dummy dataset
x = np.array([[vocab_sizes[0] - 1, vocab_sizes[1] - 1]])
y = np.array([1])
# Train models
model_A.fit(x, y, epochs=10) # Works well
model_B.fit(x, y, epochs=10) # Fails
From the error above, it somehow seems that the input x[:, 1] is wrongly being fed to the first embedding layer with vocabulary size 2, as opposed to the second. Interestingly, when I swap the vocabulary sizes (e.g. set vocab_sizes = [4, 2]) it works, supporting the previous hypothesis.
For some weird reason, looping the tensor is causing this error.
You can replace your slicing with tf.split, use the necessary adjusts and it will work well:
Extra imports:
import tensorflow as tf
from keras.layers import Flatten
# Architecture
X = Input(shape=(A,))
X_as = Lambda(lambda x: tf.split(x, A, axis=1))(X)
embeddings = []
for a, x in enumerate(X_as):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(x)
embeddings.append(embedding)
h = Concatenate(axis=1)(embeddings)
h = Flatten()(h)
h = Dense(1)(h)
Why does this happen?
Well, it's very hard to guess. My assumption is that the system is trying to apply the lambda layer using the actual variable a instead of the value you gave before (this should not be happenning, I guess, but I had exatly this problem once when loading a model: one of the variables kept its last value when loading the model instead of having a looped value)
One thing that supports this explanation is trying constants instead of a:
#Architecture
X = Input(shape=(A,))
embeddings = []
X_a1 = Lambda(lambda x: x[:, 0], name = 'lamb_'+str(0))(X)
X_a2 = Lambda(lambda x: x[:, 1], name = 'lamb_'+str(1))(X)
xs = [X_a1, X_a2]
for a, X_a in enumerate(xs):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
Solution if you want to avoid tf.split
Another thing that works (and supports the explanation that the Lambda might be using the last value of a in your code for model_B) is making the entire loop inside the Lambda layer, this way, a doesn't get any unexpected values:
#Architecture
X = Input(shape=(A,))
X_as = Lambda(lambda x: [x[:, a] for a in range(A)])(X)
embeddings = []
for a, X_a in enumerate(X_as):
embedding = Embedding(input_dim=vocab_sizes[a],
output_dim=1)(X_a)
embeddings.append(embedding)
h = Concatenate()(embeddings)
h = Dense(1)(h)
I believe the following is happening:
(1) When you do the initial "for loop" over the Lambda function, you are initializing the constant tensors which feed into the "strided_slice" operator that extracts either the [:,0] or [:,1] correctly. Using the global variable "a" in the Lambda function is probably "risky" but works okay in this instance. Furthermore, I believe that the function is being stored in bytecode as "lambda x: x[:, a]" so it will try to look up whatever the value of "a" is at the time of evaluation. "a" could be anything so might be problematic under some cases.
(2) When you build the first model (model_A), the constant tensors are not reinitialized, so the lambda functions (strided_slice operator) has the correct values (0 and 1) which were initialized in the "for loop."
(3) When you build the second model (model_B), the constant tensors are reinitialized. However, at this time, the value of "a" is now 1 (as stated by some of the other commentary), because that is the final value after the original "for loop." In fact, you can set a=0, just before defining model_B, and you'll actually get behavior which corresponds to both Lambdas extracting [:,0] and feeding it to the embedded layers. My speculation for this difference in behavior is perhaps related to calling the Model_A(X) class initialization in this case (whereas in the first model, you only specified the output layer "h" and didn't call the Model_A() class as the output - this difference I believe was also suggested by other commentary).
I'll say that I verified this state of affairs by putting in some print statements in the file "frameworks/constant_op.py" during the operator initialization step and obtained debug statements with values and sequences consistent with what I stated above.
I hope this helps.
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
I am trying to build a custom loss function in keras. Unfortunately i have little knowledge with tensor flow. Is there a way i can convert the incoming tensors into a numpy array so i can compute my loss function?
Here is my function:
def getBalance(x_true, x_pred):
x_true = np.round(x_true)
x_pred = np.round(x_pred)
NumberOfBars = len(x_true)
NumberOfHours = NumberOfBars/60
TradeIndex = np.where( x_pred[:,1] == 0 )[0]
##remove predictions that are not tradable
x_true = np.delete(x_true[:,0], TradeIndex)
x_pred = np.delete(x_pred[:,0], TradeIndex)
CM = confusion_matrix(x_true, x_pred)
correctPredictions = CM[0,0]+CM[1,1]
wrongPredictions = CM[1,0]+CM[0,1]
TotalTrades = correctPredictions+wrongPredictions
Accuracy = (correctPredictions/TotalTrades)*100
return Accuracy
If its not possible to use numpy array's what is the best way to compute that function with tensorflow? Any direction would be greatly appreciated, thank you!
Edit 1:
Here are some details of my model. I am using a LSTM network with heavy drop out. The inputs are a multi-variable multi-time step.
The outputs are a 2d array of binary digits (20000,2)
model = Sequential()
model.add(Dropout(0.4, input_shape=(train_input_data_NN.shape[1], train_input_data_NN.shape[2])))
model.add(LSTM(30, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(2))
model.compile(loss='getBalance', optimizer='adam')
history = model.fit(train_input_data_NN, outputs_NN, epochs=50, batch_size=64, verbose=1, validation_data=(test_input_data_NN, outputs_NN_test))
EDIT: 1 Here is an untested substitution:
(took the liberty of normalizing the variable names )
def get_balance(x_true, x_pred):
x_true = K.tf.round(x_true)
x_pred = K.tf.round(x_pred)
# didnt see the need for these
# NumberOfBars = (x_true)
# NumberOfHours = NumberOfBars/60
trade_index = K.tf.not_equal(x_pred[:,1], 0 )
##remove predictions that are not tradable
x_true_tradeable = K.tf.boolean_mask(x_true[:,0], trade_index)
x_pred_tradeable = K.tf.boolean_mask(x_pred[:,0], trade_index)
cm = K.tf.confusion_matrix(x_true_tradeable, x_pred_tradeable)
correct_predictions = cm[0,0]+cm[1,1]
wrong_predictions = cm[1,0]+cm[0,1]
total_trades = correction_predictions + wrong_predictions
accuracy = (correct_predictions/total_trades)*100
return accuracy
Original Answer
Welcome to SO. As you might know we need to compute the the gradient on the loss function. We can't compute the gradient correctly on numpy arrays (they're just constants).
What is done ( in keras/theano which are the backends one uses with keras) is automatic differentiation on Tensors (e.g tf.placeholder()).This is not the entire story but what you should know at this point is that tf / theano gives us gradients by default on operators like tf.max, tf.sum.
What that means for you is all the operations on tensors (y_true and y_pred) should be rewritten to use tf / theano operators.
I'll comment with what I think would be rewritten and you can substitute accordingly and test.
See tf.round used as K.tf.round where K is the reference to the keras backend imported as
import keras.backend as K
x_true = np.round(x_true)
x_pred = np.round(x_pred)
Grab the shape of the tensor x_true. K.shape. Compute the ratio over a constant could remain as
it as Here
NumberOfBars = len(x_true)
NumberOfHours = NumberOfBars/60
See tf.where used as K.tf.where
TradeIndex = np.where( x_pred[:,1] == 0 )[0]
You could mask the tensor w/ a condition instead of deleting - see masking
##remove predictions that are not tradable
x_true = np.delete(x_true[:,0], TradeIndex)
x_pred = np.delete(x_pred[:,0], TradeIndex)
See tf.confusion_matrix
CM = confusion_matrix(x_true, x_pred)
The computation that follow are computation overs constants and so remain essentially the same ( conditioned on
whatever changes have to made given the new API )
Hopefully I can update this answer with a valid substitution that runs. But I hope this sets on the right path.
A suggestion on coding style: I see you use three version of variable naming in your code choose one and stick with it.
I am trying to create a model in which I want to predict the order of a certain set of documents given a certain query. My idea was basically to use a shared embedding layer for both the query and the documents, then merge the two "branches" using a cosine similarity between each document and the query (using a custom lambda). The loss function would then compute the difference between the expected position and the predicted similarity.
My question is: Is there a way to create Embeddings for a set of textual features (provided that they have the same length)?
I can properly transform my query in a "doc2vec-like embedding" by applying Embedding + Convolution1D + GlobalMaxPooling1D, but I had no luck using the same strategy on the sets of documents (and Reshaping + 2D convolutions don't really make sense to me given that I am working with textual data).
Note that a constraint I have is that I need to use the same Embedding layer for both my query and the set of documents (I am using the Keras' functional apis to do so).
[EDIT, adding sample code]
Q = Input(shape=(5, )) # each query is made of 5 words
T = Input(shape=(50, 50)) # each search result is made of 50 words and 50 docs
emb = Embedding(
max_val,
embedding_dims,
dropout=embedding_dropout
)
left = emb(Q)
left = Convolution1D(nb_filter=5,
filter_length=5,
border_mode='valid',
activation='relu',
subsample_length=1)(left)
left = GlobalMaxPooling1D()(left)
print(left)
right = emb(T) # <-- this is my problem, I don't really know what to do/apply here
def merger(vests):
x, y = vests
x = K.l2_normalize(x, axis=0) # Normalize rows
y = K.l2_normalize(y, axis=-1) # Normalize the vector
return tf.matmul(x, y) # obviously throws an error because of mismatching matrix ranks
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (50, 1)
merger_f = Lambda(merger)
predictions = merge([left, right], output_shape=cos_dist_output_shape, mode=merger_f)
model = Model(input=[Q, T], output=predictions)
def custom_objective(y_true, y_pred):
ordered_output = tf.cast(tf.nn.top_k(y_pred)[1], tf.float32) # returns the indices of the top values
return K.mean(K.square(ordered_output - y_true), axis=-1)
model.compile(optimizer='adam', loss=custom_objective)
[SOLUTION] thanks to Nassim Ben, use TimeDistributed like this to apply recurrently a Layer to all the dimensions of a layer like this:
right = TimeDistributed(emb)(T)
right = TimeDistributed(Convolution1D(nb_filter=5,
filter_length=5,
border_mode='valid',
activation='relu',
subsample_length=1)(right)
right = TimeDistributed(GlobalMaxPooling1D())(right)
Alright. If I understand correctly the situation, you have 50 text snippets of length 50 that you want to embed.
After doing the word embeddings, you find yourself with a Tensor T of shape (50,50,emb_size).
Whay I would do is to use a LSTM layer in a TimeDistributed wrapper. Adding those lines after emb(T) :
right = TimeDistributed(LSTM(5))(right)
This will apply the same LSTM to each of the 50 documents and output a final state of length 5 at the end of each document processing. The shape of right after this step is (50,5). You have embedded each document in a length 5 vector.
The advantage of TimeDistributed is that the LSTM applied to each document will share the same weights so your documents will be 'treated' the same way. You can find documentation about LSTM here and about TimeDistributed here.
I hope this helps a bit.