Jacobian of decoder in VAE

Jacobian of decoder in VAE - python

I am new to both ML and this forum, so please be kind.
What I would like to compute is the jacobian of the decoder part of the VAE w.r.t. the latent vectors. I have found the jacobian_batch function in tensorflow.python.ops.parallel_for.gradients that in principle could do the job for me. However, I could not get this function to work.
More specifically, I tried:
xin = tf.constant(np.array(latent_vector[2]),dtype=tf.float32)
f_of_xin = tf.constant(decoder.predict(tf.Session().run(xin), batch_size = 1000000, verbose=1),dtype=tf.float32)
jac = batch_jacobian(f_of_xin,xin)
Which does not work (returns None shape), while:
f_of_xin = tf.sin(tf.sin(xin))
jac = batch_jacobian(f_of_xin,xin)
works (returns a 3x3 matrix (# of input & output dim = 3) with sensible numbers).
I tried:
f_of_xin = tf.sin(tf.sin(tf.constant(tf.Session().run(xin))))
as well, in which case the jacobian_batch function does not work anymore (which means that it returns a None shape). I guess it has to do with the data conversion, but I don't know how to feed a tf structure to the model.predict function. By the way, my model here (decoder) is just a simple neural network with 1 hidden layer. Could you help me?
Kind regards,
Melissa

Related

Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.

This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.

From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.

Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html

Batch inference of softmax does not sum to 1

I am working with REINFORCE algorithm with PyTorch. I noticed that the batch inference/predictions of my simple network with Softmax doesn’t sum to 1 (not even close to 1). I am attaching a minimum working code so that you can reproduce it. What am I missing here?
import numpy as np
import torch
obs_size = 9
HIDDEN_SIZE = 9
n_actions = 2
np.random.seed(0)
model = torch.nn.Sequential(
torch.nn.Linear(obs_size, HIDDEN_SIZE),
torch.nn.ReLU(),
torch.nn.Linear(HIDDEN_SIZE, n_actions),
torch.nn.Softmax(dim=0)
)
state_transitions = np.random.rand(3, obs_size)
state_batch = torch.Tensor(state_transitions)
pred_batch = model(state_batch) # WRONG PREDICTIONS!
print('wrong predictions:\n', *pred_batch.detach().numpy())
# [0.34072137 0.34721774] [0.30972624 0.30191955] [0.3495524 0.3508627]
# DOES NOT SUM TO 1 !!!
pred_batch = [model(s).detach().numpy() for s in state_batch] # CORRECT PREDICTIONS
print('correct predictions:\n', *pred_batch)
# [0.5955179 0.40448207] [0.6574412 0.34255883] [0.624833 0.37516695]
# DOES SUM TO 1 AS EXPECTED

Although PyTorch lets us get away with it, we don’t actually provide an input with the right dimensionality. We have a model that takes one input and produces one output, but PyTorch nn.Module and its subclasses are designed to do so on multiple samples at the same time. To accommodate multiple samples, modules expect the zeroth dimension of the input to be the number of samples in the batch.
Deep Learning with PyTorch
That your model works on each individual sample is an implementation nicety. You have incorrectly specified the dimension for the softmax (across batches instead of across the variables), and hence when given a batch dimension it is computing the softmax across samples instead of within samples:
nn.Softmax requires us to specify the dimension along which the softmax function is applied:
softmax = nn.Softmax(dim=1)
In this case, we have two input vectors in two rows (just like when we work with
batches), so we initialize nn.Softmax to operate along dimension 1.
Change torch.nn.Softmax(dim=0) to torch.nn.Softmax(dim=1) to get appropriate results.

Attention is all you need, keeping only the encoding part for video classification

I am trying to modify a code that could find in the following link in such a way that the proposed Transformer model that is related to the paper: all you need is attention would keep only the Encoder part of the whole Transformer model. Furthermore, I would like to modify the input of the Network, instead of being a sequence of text to be a sequence of images (or better-extracted features of images) coming from a video. In a sense, I would like to figure out which frames are related to each other from my input and encode that info in an output embedding in the same way that is happening to the Transformers model.
The project as it is in the link provided is mainly performing sequence-sequence transformation. The input is text from one language and the output is text in another language. The main formation of the model is happening in the lines 386-463. Where the model is initialized and the compile of the Model is happening. For me I would like to do something like:
#414-416
self.encoder = SelfAttention(d_model, d_inner_hid, n_head, layers, dropout)
#self.decoder = Decoder(d_model, d_inner_hid, n_head, layers, dropout)
#self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
#434-436
enc_output = self.encoder(src_emb, src_seq, active_layers=active_layers)
#dec_output = self.decoder(tgt_emb, tgt_seq, src_seq, enc_output, active_layers=active_layers)
#final_output = self.target_layer(dec_output)
Furthermore, since I would like to combine the output of the Encoder which is the output of MultiHeadAttention and PositionwiseFeedForward using an LSTM and a Dense layer which will tune the whole Encoding procedure using classification optimization. Therefore, I add when I define my model the following layers:
self.lstm = LSTM(units = 256, input_shape = (None, 256), return_sequences = False, dropout = 0.5)
self.fc1 = Dense(64, activation='relu', name = "dense_one")
self.fc2 = Dense(6, activation='sigmoid', name = "dense_two")
and then pass the output of the encoder, in line 434 using the following code:
enc_output = self.lstm(enc_output)
enc_output = self.fc1(enc_output)
enc_output = self.fc2(enc_output)
Now the video data that I would like to replace the text data provided with the Github code, have the following dimensionality: Nx10x256 where N is the number of samples, 10 is the number of frames and 256 the number of features for each frame. I have some difficulties to understand some parts of the code, in order to successfully, modified it to my needs. I guess, that now the Embedding layer is not necessary for me anymore since it is related to text classification and NLP.
Furthermore, I need to modify the input to 419-420 to be sth like:
src_seq_input = Input(shape=(None, 256,), dtype='float32') # source input related to video
tgt_seq_input = Input(shape=(6,), dtype='int32') # the target classification size (since I have 6 classes)
What other parts of the code do I need to skip or modify? What is the usefulness of the PosEncodingLayer that is used in the following line:
self.pos_emb = PosEncodingLayer(len_limit, d_emb) if self.src_loc_info else None
Is it needed in my case? Can I skip it?
After my modification in the code I noticed that when I run the code, I can check the loss function from the def get_loss(y_pred, y_true), however, in my case it is crucial to define a loss for the classification task that returns also the accuracy. How can I do so, with the provided code?
Edit:
I have to add that I treat my input as the output of the Embedding layer from the initial NLP code. Therefore, for me (in the version of code that functioned for me):
src_seq_input = Input(shape=(None, 256,), dtype='float32')
tgt_seq_input = Input(shape=(6,), dtype='int32')
src_seq = src_seq_input
#src_emb_ = self.i_word_emb(src_seq)
src_emb = src_seq
enc_output = self.encoder(src_emb, src_emb, active_layers=active_layers)
I treat src_emb as my input and completely ignore src_seq.
Edit:
The way that the loss is calculated is using the following code:
def get_loss(y_pred, y_true):
y_true = tf.cast(y_true, 'int32')
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
loss = K.mean(loss)
return loss
loss = get_loss(enc_output, tgt_seq_input)
self.ppl = K.exp(loss)
Edit:
As it is the loss function (sparse_softmax_cross_entropy_with_logits) returns a loss score. Even if the whole procedure is about classification. How, can I further, tune my system to return also the accuracy?

I'm afraid this approach is not going to work.
Video data has massive dependence between adjacent frames, with each frame very similar to the last. There is also a weaker dependence on prior frames, because objects tend to continue to move relative to other objects in similar ways. Modern video formats use this redundancy to achieve high compression rates by modelling the motions.
This means that your network will have an extremely strong attention on the previous image. As you suggest, you could subsample frames several seconds apart to destroy much of the dependence on the previous frame, but if you did so I really wonder whether you would find structure at all in the result? Even if you feed it hand-coded features optimised for the purpose, there are are few general rules about which features will be in motion and which will not, so what structure can your attention network learn?
The problem of handling video is just radically different from handling sentences. Video has very complex elements (pictures) that are largely static over time and have locally predictable motions over a few frames in very simple ways. Text has simple elements (words) in a complex sentence structure with complex dependence extending over many words. These differences mean they require fundamentally different approaches.

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.

By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.

I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).

Keras, custom "how different is" loss function definition issue

I'm currently building a CNN-LSTM encoder-decoder anomaly detector by reconstruction on Keras, following the directions of of Malhotra et al but with the CNN encoder difference, still I'm aiming to use the loss(objective) function defined there as:
Where X is a sample of a time series of L steps, and x(i) is the i-th real vector and x'(i) the reconstructed. The total training set is that s_N.
I made the loss function, but I think it's not behaving well, so I appeal to you and your knowledge to see if is this my error source or I might need to find elsewhere:
def mirror_loss(y_true,y_pred):
diff = tf.square(tf.norm(tf.substract(y_true, y_pred), axis = 1))
return K.sum(diff, axis = -1)
It bothers me that I have to use both tensorflow and keras.backend because I wasn't able to find the "norm" on keras.backend.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.