I'm training Neural Networks for classification using TensorFlow/Keras, and I would like the weights in the output layer to have the following property:
Suppose the weight or kernel matrix is a 3 by 4 matrix W, and its elements are W_ij
I would like for each column j, there is one and only one nonzero W_ij, and W_ij = 1.
What would be a good way to implement this requirement?
One possible solution I can think of is to put the following constraints:
W_1j + W_2j + W_3j = 1 for all j = 1,2,3,4
and
W_ij * (1-W_ij) = 0, for all i, j
How do I implement these constraints? Or is there any better way to set this requirement?
Are you sure you want to control the weight???? If you try to enforce that type of constraint to the weight, your NN will probably never learn anything.
It seems to me that you just want a softmax layer in the output.
A softmax would have exactly what you are saying. Let's supposed you are classifying cats, dogs, birds. If you use a softmax in the output, you will get always a tensor where only one element is 1 ( the most likely class)
example
[1,0,0] #cat
[0,1,0] #dog
[0,0,1] # bird
Related
I am trying to write a model that outputs a vector of length N consisting of labels -1,0 and 1. Each of the labels depicts one of three decisions for the system participants (wireless devices). So the vector depicts a system state that is then passed on to an optimization problem in the next step. Due to the fix problem formulation that is awaiting the output vector a selection of 0,1 and 2 instead is not possible.
After coming across this tanh function to supply the -1,0 and 1 values:
1.5 * backend.tanh(alpha * x) + 0.5 * (backend.tanh(-(3 / alpha) * x)) from here, I was wondering how exactly this output layer and the penultimate layer can be built to suply this vector of labels {-1,0,1}. I tried using the above function in the output layer in a simple Iris classificator. But this resulted in terrible accuracy compared to the one achieved with 0,1,2 and softmax output layer.
Thanks in advance,
with kind regards,
Yuka
It doesn't seem like the outputs are actually "numerically related", for lack of a better term. Meaning, the labels could just as well be "left", "right", "up". So I think your best bet is to have 3 output nodes in the final layer, with softmax activation function, with each of the three nodes representing each of the three labels, using a Cross entropy loss function.
If your training data currently has the target as -1/0/1, you should one-hot encode it so that each target is a vector of length 3. So label 0 might be [0,1,0]
I am training 2 autoencoders with 2 separate input paths jointly and I would like to randomly set one of the input paths to zero.
I use tensorflow with keras backend (functional API).
I am computing a joint loss (sum of two losses) for backpropagation.
A -> A' & B ->B'
loss => l2(A,A')+l2(B,B')
networks taking A and B are connected in latent space.
I would like to randomly set A or B to zero and compute the loss only on the corresponding path, meaning if input path A is set to zero loss be computed only by using outputs of only path B and vice versa; e.g.:
0 -> A' & B ->B'
loss: l2(B,B')
How do I randomly set input path to zero? How do I write a callback which does this?
Maybe try the following:
import random
def decision(probability):
return random.random() < probability
Define a method that makes a random decision based on a certain probability x and make your loss calculation depend on this decision.
if current_epoch == random.choice(epochs):
keep_mask = tf.ones_like(A.input, dtype=float32)
throw_mask = tf.zeros_like(A.input, dtype=float32)
if decision(probability=0.5):
total_loss = tf.reduce_sum(reconstruction_loss_a * keep_mask
+ reconstruction_loss_b * throw_mask)
else:
total_loss = tf.reduce_sum(reconstruction_loss_a * throw_mask
+ reconstruction_loss_b * keep_mask)
else:
total_loss = tf.reduce_sum(reconstruction_loss_a + reconstruction_loss_b)
I assume that you do not want to set one of the paths to zero every time you update your model parameters, as then there is a risk that one or even both models will not be sufficiently trained. Also note that I use the input of A to create zero_like and one_like tensors as I assume that both inputs have the same shape; if this is not the case, it can easily be adjusted.
Depending on what your goal is, you may also consider replacing your input of A or B with a random tensor e.g. tf.random.normal based on a random decision. This creates noise in your model, which may be desirable, as your model would be forced to look into the latent space to try reconstruct your original input. This means precisely that you still calculate your reconstruction loss with A.input and A.output, but in reality your model never received the A.input, but rather the random tensor.
Note that this answer serves as a simple conceptual example. A working example with Tensorflow can be found here.
You can set an input to 0 simply:
A = A*random.choice([0,1])
This code can be used inside a loss function
I am learning Neural network.
Here is the complete piece of code:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Exercises).ipynb
When I transpose features, I get the following output:
import torch
def activation(x):
return 1/(1+torch.exp(-x))
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable
# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
product = features.t() * weights + bias
output = activation(product.sum())
tensor(0.9897)
However, if I transpose weights, I get a different output:
weights_prime = weights.view(5,1)
prod = torch.mm(features, weights_prime) + bias
y_hat = activation(prod.sum())
tensor(0.1595)
Why does this happen?
Update
I took a look at the solution:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Solution).ipynb
And I saw this:
y = activation((features * weights).sum() + bias)
why can a matrix features(1,5) multiply another matrix weights(1,5) without transposing weights first?
Update 2
After read several posts, I realized that
matrixA * matrixB is different from torch.mm(matrixA,matrixB) and torch.matmul(matrixA,matrixB).
Could someone confirm my three understandings between?
So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication.
differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases.
In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication.
Update 3
Just to bring in the Video screenshot for someone who has the same confusion:
And here is the video link: https://www.youtube.com/watch?time_continue=98&v=6Z7WntXays8&feature=emb_logo
Looking at https://pytorch.org/docs/master/generated/torch.nn.Linear.html
The typical linear (fully connected) layer in torch uses input features of shape (N,∗,in_features) and weights of shape (out_features,in_features) to produce an output of shape (N,*,out_features). Here N is the batch size, and * is any number of other dimensions (may be none).
The implementation is:
output = input.matmul(weight.t())
So, the answer is that neither of your formulas is correct according to convention; the standard formula is the one above.
You may use a non-standard shape since you're implementing things from scratch; as long as it's consistent it may work, but I don't recommend it for learning. It's unclear what 1 and 5 is in your code, but presumably you want 5 input features and one output feature, with a batch size of 1 as well. In which case the standard shapes should be input = torch.randn((1, 5)) for batch size=1 and in_features=5, and weights = torch.randn((5, 1)) for in_features=5 and out_features=1.
There is no reason why weights should ever be the same shape as features; thus weights = torch.randn_like(features) doesn't make sense.
Lastly, for your actual questions:
"Should I transpose features or weights in Neural network?" - in torch convention, you should transpose weights, but use matmul with the features first. Other frameworks may have a different convention; as long as in_features dimension of the weights is multiplied by the num_features dimension of the input, it would work.
"Why does this happen?" - these are two completely different calculations; there is no reason to think they would produce the same result.
"So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication." - Yes; mm is matrix-matrix only, matmul is vector-matrix or matrix-matrix, including batched versions of same - check the docs for everything matmul can do (which is kinda a lot).
"differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases." - Yes; the big difference is that matmul can broadcast. Use it when you specifically intend that; use mm to prevent unintentional broadcasting.
"In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication." - I doubt it; it's probably an error in the Udacity code. This bit of code weights = torch.randn_like(features) looks like an error in any case; the dimensions of weights have a meaning different from the dimensions of features.
This line is taking an outer product between the two vectors.
product = features.t() * weights + bias
The resulting shape is 5x5.
If you change this to a dot product, then output will match y_hat.
product = torch.mm(weights, features.t()) + bias
At the moment I'm playing around with machine learning in python based on this website (part two is about image recognition) . I would like to train a network to recognize 4 specific points in am image but My problem is:
The neural network is created by simply multiplying matrices together, calculate the delta between the given output and the recognized output and recalculate the weights in the matrix. Now let' say I have a 600x800 pixel image as input. If I multiply this with my layer matrices I can't get a 4x2 matrix as output (x,y for each point).
My second problem is how much hidden layers should I have for this problem? Are more layers always better but need longer to calculate? Can we guess how much hidden layers we need or should we test some values and use the best of it?
My current neural network code:
from os.path import isfile
import numpy as np
class NeuralNetwork:
def __init__(self):
np.random.seed(1)
self.syn0 = 2 * np.random.random((480000,8)) - 1
#staticmethod
def relu(x, deriv=False):
if(deriv):
res = np.maximum(x, 0)
return np.minimum(res, 1)
return np.maximum(x, 0)
def train(self, imgIn, out):
l1 = NeuralNetwork.relu(np.dot(imgIn, self.syn0))
l1_error = out - l1
exp = NeuralNetwork.relu(l1,True)
l1_delta = l1_error * exp
self.syn0 += np.dot(imgIn.T,l1_delta)
return l1 #np.abs(out - l1)
def identify(self, img):
return NeuralNetwork.relu(np.dot(imgIn, self.syn0))
Problem 1. Input data.
You must serialize the input. For example, if you have one 600*800 pixel image, input must be 1*480000(rows, cols).
Row means the number of data and column means the dimension of data.
Problem 2. Classification.
If you want to classify 4 different type of classes, you should use (1,4) vector for output. For example, there are 4 classes ('Fish', 'Cat', 'Tiger', 'Car'). Then vector (1,0,0,0) means Fish.
Problem 3. Fully connected network.
I think the example in this homepage uses fully connected network. It uses whole image for classifying once. If you want to classify with subset of image. You should use convolution neural network or other approach. I don't know well about this.
Problem 4. Hyperparameter
It depends on data. you must test with various hyper parameter. then choose best hyper parameter.
I'm confused about how to define weight matrices for LSTM. As there are are 8 weight matrices for the LSTM, I don't know how to initialize those weight matrices for a LSTM in tensorflow.
But then I came across this implementation, which makes complete sense, as it has all the 8 weight matrices, but it doesn't use tensorflow implementation of LSTM. It is consistent with the LSTM equations. But in the tensorflow implementation of LSTM I don't know how to define all those 8 weight matrices as they are defined in the first above implementation.
Could you please help me out?
First thing first: If you take a close look, there aren't exactly 8 matrices but 14 in total. 4 * 3 matrices of W, U (parameter matrices) and b (bias vector) for Input Gate, Forget Gate, Cell State and Output State. In addition, there are two more matrices W and b for dense layer.
Now coming to actual question, I presume that you would like to know how these matrices are initialized in Tensorflow.
I am answering questions wrt Tensorflow v1.2.
Quick Answer: Use TF-api for LSTM which has a parameter called initializer to be used for initializing weight and projection matrices.
By default, bias vector is initialized to zero vector and kernel
initializers are initialized randomly using uniform distribution.
Now to take a look at where W and b are used, you need to dig little deeper inside the code. I will provide few checkpoints for the same.
Method call to evaluate the multiplications: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L576
lstm_matrix = _linear([inputs, m_prev], 4 * self._num_units, bias=True)
Actual computation: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L1051
weights = vs.get_variable(
_WEIGHTS_VARIABLE_NAME, [total_arg_size, output_size],
dtype=dtype,
initializer=kernel_initializer)
Here. instead of 4 separate multiplications, a single multiplication is performed and then the matrix is split into 4 parts.
So, in nutshell, Tensorflow does initialization automatically for you.
In case if you don't want to go with default initializers, you have
flexibility to provide other options.