Keras, custom "how different is" loss function definition issue

Keras, custom "how different is" loss function definition issue - python

I'm currently building a CNN-LSTM encoder-decoder anomaly detector by reconstruction on Keras, following the directions of of Malhotra et al but with the CNN encoder difference, still I'm aiming to use the loss(objective) function defined there as:
Where X is a sample of a time series of L steps, and x(i) is the i-th real vector and x'(i) the reconstructed. The total training set is that s_N.
I made the loss function, but I think it's not behaving well, so I appeal to you and your knowledge to see if is this my error source or I might need to find elsewhere:
def mirror_loss(y_true,y_pred):
diff = tf.square(tf.norm(tf.substract(y_true, y_pred), axis = 1))
return K.sum(diff, axis = -1)
It bothers me that I have to use both tensorflow and keras.backend because I wasn't able to find the "norm" on keras.backend.

Related

Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.

This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.

From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.

Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)

You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.

Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.

When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

Optimizing subgraph of large graph - slower than optimizing subgraph by itself

I have a very large tensorflow graph, and two sets of variables: A and B. I create two optimizers:
learning_rate = 1e-3
optimizer1 = tf.train.AdamOptimizer(learning_rate).minimize(loss_1, var_list=var_list_1)
optimizer2 = tf.train.AdamOptimizer(learning_rate).minimize(loss_2, var_list=var_list_2)
The goal here is to iteratively optimize variables 1 and variables 2. The weights from variables 2 are used in the computation of loss 1, but they're not trainable when optimizing loss 1. Meanwhile, the weights from variables 1 are not used in optimizing loss 2 (I would say this is a key asymmetry).
I am finding, weirdly, that this optimization for optimizer2 is much, much slower (2x) than if I were to just optimize that part of the graph by itself. I'm not running any summaries.
Why would this phenomenon happen? How could I fix it? I can provide more details if necessary.

I am guessing that this is a generative adversarial network given by the relation between the losses and the parameters. It seems that the first group of parameters are the generative model and the second group make up the detector model.
If my guesses are correct, then that would mean that the second model is using the output of the first model as its input. Admittedly, I am much more informed about PyTorch than TF. There is a comment which I believe is saying that the first model could be included in the second graph. I also think this is true. I would implement something similar to the following. The most important part is just creating a copy of the generated_tensor with no graph:
// An arbitrary label
label = torch.Tensor(1.0)
// Treat GenerativeModel as the model with the first list of Variables/parameters
generated_tensor = GenerativeModel(random_input_tensor)
// Treat DetectorModel as the model with the second list of Variables/parameters
detector_prediction = DetectorModel(generated_tensor)
generated_tensor_copy = torch.tensor(generated_tensor, requires_grad=False)
detector_prediction_copy = DetectorModel(generated_tensor_copy)
//This is for optimizing the first model, but it has the second model in its graph
// which is necessary.
loss1 = loss_func1(detector_prediction, label)
// This is for optimizing the second model. It will not have the first model in its graph
loss2 = loss_func2(detector_prediction_copy, label)
I hope this is helpful. If anyone knows how to do this in TF, that would probably be very invaluable.

What is the state of the art way of doing regression with probability in pytorch

All regression examples I find are examples where you predict a real number and unlike with classification you dont the the confidence the model had when predicting that number. I have done in reinforcement learning another way the output is instead the mean and std and then you sample from that distribution. Then you know how confident the model is at predicting every value. Now I cant find how to do this using supervised learning in pytorch. The problem is that I dont understand how to perform sample from the distribution the get the actual value while training or what sort of loss function I should use, not sure how for example MSE or L1Smooth would work.
Is there any example ot there where this is done in pytorch in a robust and state of the art way?

The key point is that you do not need to sample from the NN-produced distribution. All you need is to optimize the likelihood of the target value under the NN distribution.
There is an example in the official PyTorch example on VAE (https://github.com/pytorch/examples/tree/master/vae), though for multidimensional Bernoulli distribution.
Since PyTorch 0.4, you can use torch.distributions: instantiate distribution distro with outputs of your NN and then optimize -distro.log_prob(target).
EDIT: As requested in a comment, a complete example of using the torch.distributions module.
First, we create a heteroscedastic dataset:
import numpy as np
import torch
X = np.random.uniform(size=300)
Y = X + 0.25*X*np.random.normal(size=X.shape[0])
We build a trivial model, which is perfectly able to match the generative process of our data:
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.mean_coeff = torch.nn.Parameter(torch.Tensor([0]))
self.var_coeff = torch.nn.Parameter(torch.Tensor([1]))
def forward(self, x):
return torch.distributions.Normal(self.mean_coeff * x, self.var_coeff * x)
mdl = Model()
optim = torch.optim.SGD(mdl.parameters(), lr=1e-3)
Initialization of the model makes it always produce a standard normal, which is a poor fit for our data, so we train (note it is a very stupid batch training, but demonstrates that you can output a set of distributions for your batch at once):
for _ in range(2000): # epochs
dist = mdl(torch.from_numpy(X).float())
obj = -dist.log_prob(torch.from_numpy(Y).float()).mean()
optim.zero_grad()
obj.backward()
optim.step()
Eventually, the learned parameters should match the values we used to construct the Y.
print(mdl.mean_coeff, mdl.var_coeff)
# tensor(1.0150) tensor(0.2597)

Backpropagation with Rectified Linear Units

I have written some code to implement backpropagation in a deep neural network with the logistic activation function and softmax output.
def backprop_deep(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = dsigmoid(node_values[i][:,:-1]) * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
The code works well, but when I switched to ReLU as the activation function it stopped working. In the backprop routine I only change the derivative of the activation function:
def backprop_relu(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = (node_values[i]>0)[:,:-1] * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
However, the network no longer learns, and the weights quickly go to zero and stay there. I am totally stumped.

Although I have determined the source of the problem, I'm going to leave this up in case it might be of benefit to someone else.
The problem was that I did not adjust the scale of the initial weights when I changed activation functions. While logistic networks learn very well when node inputs are near zero and the logistic function is approximately linear, ReLU networks learn well for moderately large inputs to nodes. The small weight initialization used in logistic networks is therefore not necessary, and in fact harmful. The behavior I was seeing was the ReLU network ignoring the features and attempting to learn the bias of the training set exclusively.
I am currently using initial weights distributed uniformly from -.5 to .5 on the MNIST dataset, and it is learning very quickly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.