Neural network hidden layer vs. Convolutional hidden layer intuition - python

If there are 10 features and 1 output class (sigmoid activation) with a regression objective:
If I use only 5 neurons in my first dense hidden layer: will the first error be calculated solely based on half of the training feature set? Isn't it imperative to match the # of features with the neurons in hidden layer #1 so that the model can see all the features at once? Otherwise it's not getting the whole picture? The first fwd propagation iteration would use 5 out of 10 features, and get the error value (and train during backprop, assume batch grad descent). Then the 2nd fwd propagation iteration would see the remaining 5 out of 10 features with updated weights and hopefully arrive at a smaller error. BUT its only seeing half the features at a time!
Conversely, if I have a convolutional 2D layer of 64 neurons. And my training shape is: (100, 28,28,1) (pictures of cats and dogs in greyscale), will each of the 64 neurons see a different 28x28 vector? No right, because it can only send one example through the forward propagation at a time? So then only a single picture (cat or dog) should be spanned across the 64 neurons? Why would you want that since each neuron in that layer has the same filter, stride, padding and activation function? When you define a Conv2D layer...the parameters of each neuron are the same. So is only a part of the training example going into each neuron? Why have 64 neurons, for example? Just have one neuron, use a filter on it and pass it along to a second hidden layer with another filter with different parameters!
Please explain the flaws in my logic. Thanks so much.
EDIT: I just realized for Conv2D, you flatten the training data sets so it becomes a 1D vector and so a 28x28 image would mean having an input conv2d layer of 724 neurons. But I am still confused for the dense neural network (paragraph #1 above)

What is your "first" layer?
Normally you have an input layer as first layer, which does not contain any weights.
The shape of the input layer must match the shape of your feature data.
So basically when you train a model with 10 features, but only have a input layer of shape (None,5) (where none stands for the batch_size), tensorflow will raise an exception, because it needs data for all inputs in the correct shape.
So what you said is just not going to happen. If you only have 5 features, the next 5 features wont be fit into the net in the next iteration but, the next sample will be send to the model instead. (Lets say no exception is thrown) So of the next sample also only the first 5 features would be used.
What you can do instead, use a input_layer as first layer with the correct shape of your features. Then as secodn layer, you can use any shape you like, 1,10,100 dense neurons, its up to you (and what works well of course). The shape of the output again must match (this time) the shape of your label data.
I hope this makes it more clear


Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

How dropout works in tensorflow

how dropout works for 3d input.
for example, for tf.nn.dropout(x, 0.1)
if size of x is 3 dimensional of size 2 * 10 * 10
then is dropout applied channel-wise means in each channel it ignores randomly 10 values
(so that in this example for the first channel ignores randomly 10 and for 2nd channel ignores randomly 10)
or it
applies on whole input means it ignores randomly 20 values from whole 2 * 10 * 10 features?
Dropout is nothing but regularization but we only use few features or neurons in a layer instead of regularization of whole layer. Dropout can be applied for each layer if necessary. We have to worry about one parameter mainly when dealing with dropout i.e., rate. And also dropout is used for overfitting or high variance. Here's how it works:
Let consider a single neuron neural network with only 5 features which are given to a single neuron. Now the adding of dropout in this example is because the neuron can't rely on any single feature so, when dropout applied to this input then it makes few inputs according to rate given and it selects randomly few features and eliminate at that particular iteration. So, for every iteration dropout will randomly eliminate features this makes the neuron not to rely on any one feature.
Now let us consider deep neural network with 5 layers with 3 features hidden units:[7, 7, 3, 2, 1] where last layer is the output layer. Now in this example we can add dropout for every layer but here's how it varies. When applied to first layer which has 7 units, we use rate = 0.3 which means we have to drop 30% of units from 7 units randomly. For next layer which has 7 units, we add dropout rate = 0.5 because here previous layer 7 units and this layer 7 units which make this layer to overfit the data so, we are making 50% of units drop. For third layer we should decrease rate because hidden layers for third layer is 7 and fourth layer is 3 which means there are few connection only. And now with next layers we don't need to apply dropout because it only has few hidden units.
Now we can apply dropout for input layer but it is not done often. If may be done then the rate to drop will be almost 0.1 and less than that.
This is all about dropout. I have done a deep learning course on coursera so sharing this information from that knowledge.

Units in Dense layer in Keras

I am trying to understand a concept of ANN architecture in Keras. Number of input neurons in any NN should be equal to the number of features/attributes/columns. So, in the case of having matrix of (20000,100), my input shape should have 100 neurons. In the example on the Keras page, I saw a code:
model = Sequential([Dense(32, input_shape=(784,)),
, which pretty much means that input shape has 784 columns and 32 is the dimensionality of output space, which pretty means that the second layer will have an input of 32. My understanding is that such a significant drop happens because some of the units are not activated due to an activation function. Is my understanding correct?
At the same time, another piece of code, shows that number of input neurons is higher than number of features:
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
This example is not clear to me. How can it be that the size of units is larger that number of input dimensions?
The total number of neurons in a Dense layer is a topic that is still not agreed upon within the machine learning and data science community. There are many heuristics that are used to define this and I refer you to this post on Cross Validated that provides some more details:
In summary, the number of hidden units between both methods as you have specified most likely originated from repeated experimentation and trial-and-error achieving the best accuracy.
However, for more context the answer to this as I mentioned is through experimentation. 784 for the input neurons most likely comes from the MNIST dataset, which are images that are 28 x 28 = 784. I've seen implementations of neural networks where 32 neurons for the hidden layer is good. Think of each layer as a dimensionality transformation. Even if you go down to 32 dimensions, that doesn't necessarily mean that it will lose accuracy. Also from going from a lower dimensional space to higher dimensional space, that's common if you are trying to map your points to a new space that may be easier for classification.
Finally, in Keras, that number specifies how many neurons are for the current layer. Under the hood, it figures out the weight matrix to satisfy the forward propagation going from the previous layer to the current layer. It would be 785 x 32 in that case with 1 extra neuron for the bias unit.
Neural Networks are basicly matrix multiplications, the drop you are talking about in the first part is not due to an Activation function, it's only happen because of the nature of matrix multiplication :
The calcul here is : input * weights = output
so -> [BATCHSIZE, 784] * [784, 32] = [BATCHSIZE, 32] -> output dimension
With that logic we can easily explain how we can have an input shape << size of units, it will give this calcul :
-> [BATCHSIZE, 20] * [20, 64] = [BATCHSIZE, 64] -> output dimension
Hope that helped you !
To learn more :

The Diagram explanation of the LSTM Network

I am working with LSTM for my time series forecasting problem. I have the following network:
model = Sequential()
model.add(LSTM(units_size=300, activation=activation, input_shape=(20, 1)))
My forecasting problem is to forecast the next 20 time steps looking back the last 20 time steps. So, for each iteration, I have an input shape like (x_t-20...x_t) and forecast the next (x_t+1...x_t+20). For the hidden layer, I use 300 hidden units.
As LSTM is different than the simple feed-forward neural network, I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells? As I describe above, I have 20 time steps to predict and are these all steps generated from the last LSTM cels? I have no idea. Can some generally give a diagram example of this kind of network structure?
Regarding these questions,
I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells?
It is simpler to consider the LSTM layer you have defined as a single block. This diagram is heavily borrowed from Francois Chollet's Deep Learning with Python book:
In your model, input shape is defined as (20,1), so you have 20 time-steps of size 1. For a moment, consider that the output Dense layer is not present.
model = Sequential()
model.add(LSTM(300, input_shape=(20,1)))
lstm_7 (LSTM) (None, 300) 362400
The output shape of the LSTM layer is 300 which means the output is of size 300.
output = model.predict(np.zeros((1, 20, 1)))
(1, 300)
input (1,20,1) => batch size = 1, time-steps = 20, input-feature-size = 1.
output (1, 300) => batch size = 1, output-feature-size = 300
Keras recurrently ran the LSTM for 20 time-steps and generated an output of size (300). In the diagram above, this is Output t+19.
Now, if you add the Dense layer after LSTM, the output will be of size 20 which is straightforward.
To understand LSTMs, I'd recommend first spending a few minutes to understand 'plain vanilla' RNNs, as LSTMs are just a more complex version of that. I'll try to describe what's happening in your network if it was a basic RNN.
You are training a single set of weights that are repeatedly used for each time step (t-20,...,t). The first weight (let's say W1) is for inputs. One by one, each of x_t-20,...,x_t is multiplied by W1, then a non-linear activation function is applied - same as any NN forward pass.
The difference with RNNs is the presence of a separate 'state' (note: not a trained weight), that can start off random or zero, and carries information about your sequence across time steps. There's another weight for the state (W2). So starting at the first time step t-20, the initial state is multiplied by W2 and an activation function applied.
So at timestep t-20 we have the output from W1 (on inputs) and W2 (on state). We can combine these outputs at each timestep, and use it to generate the state to pass to the next timestep, i.e. t-19. Because the state has to be calculated at each timestep and passed to the next, these calculations have to happen sequentially starting from t-20. To generate our desired output, we can take each output state across all timesteps - or only take the output at the final timestep. As return_sequences=False by default in Keras, you are only using the output at the final timestep, which then goes into your dense layer.
The weights W1 and W2 need to have one dimension equal to the dimensions of each timestep input x_t-20... for matrix multiplication to work. This dimension is 1 in your case, as each of the 20 inputs are a 1d vector (or number), which is multiplied by W1. However, we're free to set the second dimension of the weights as we please - 300 in your case. So W1 is of size 1x300, and is multiplied 20 times, once for each timestep.
This lecture will take you through the basic flow diagram of RNNs that I described above, all the way to more advanced stuff which you can skip. This is a famous explanation of LSTMs if you want to make the leap from basic RNNs to LSTMs, which you may not need to do - there are just more complicated weights and states.

LSTM example with MNIST

I am currently trying to understand the meaning of outputs and states of the tf.nn.rnn function in tensorflow:
outputs, states = tf.nn.rnn(lstm_cell, x, dtype=tf.float32)
through the LSTM MNIST tutorial.
Indeed, with respect to the following tutorial, Understanding LSTM, I am wondering what correspond to these variables.
In my opinion, outputs correspond to the hidden state (denoted as h_t in the previous link) but I am not sure.
Thus, I understand that outputs is a list of time_steps tensors of shape (batch_size, n_hidden). But why does states is a a list of 2 tensors of shape (batch_size, n_hidden). Is it just the cell state for the last time step?
This seems like an old question. Sorry if you figured it out already.
The outputs are the values coming out of the "top" of your LSTM units in the diagram. That is the output of your LSTM layer in total.
As you suggest in your question, the hidden states are actually the temporal hidden states that move from one time period to the next. They are the C_t and h_t values that move along the time axis as output/input to the unrolled LSTM layer. That is why you get 2 tensors each with size (batch_size,n_hidden)

