I am trying to understand a concept of ANN architecture in Keras. Number of input neurons in any NN should be equal to the number of features/attributes/columns. So, in the case of having matrix of (20000,100), my input shape should have 100 neurons. In the example on the Keras page, I saw a code:
model = Sequential([Dense(32, input_shape=(784,)),
, which pretty much means that input shape has 784 columns and 32 is the dimensionality of output space, which pretty means that the second layer will have an input of 32. My understanding is that such a significant drop happens because some of the units are not activated due to an activation function. Is my understanding correct?
At the same time, another piece of code, shows that number of input neurons is higher than number of features:
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
This example is not clear to me. How can it be that the size of units is larger that number of input dimensions?
The total number of neurons in a Dense layer is a topic that is still not agreed upon within the machine learning and data science community. There are many heuristics that are used to define this and I refer you to this post on Cross Validated that provides some more details: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw.
In summary, the number of hidden units between both methods as you have specified most likely originated from repeated experimentation and trial-and-error achieving the best accuracy.
However, for more context the answer to this as I mentioned is through experimentation. 784 for the input neurons most likely comes from the MNIST dataset, which are images that are 28 x 28 = 784. I've seen implementations of neural networks where 32 neurons for the hidden layer is good. Think of each layer as a dimensionality transformation. Even if you go down to 32 dimensions, that doesn't necessarily mean that it will lose accuracy. Also from going from a lower dimensional space to higher dimensional space, that's common if you are trying to map your points to a new space that may be easier for classification.
Finally, in Keras, that number specifies how many neurons are for the current layer. Under the hood, it figures out the weight matrix to satisfy the forward propagation going from the previous layer to the current layer. It would be 785 x 32 in that case with 1 extra neuron for the bias unit.
Neural Networks are basicly matrix multiplications, the drop you are talking about in the first part is not due to an Activation function, it's only happen because of the nature of matrix multiplication :
The calcul here is : input * weights = output
so -> [BATCHSIZE, 784] * [784, 32] = [BATCHSIZE, 32] -> output dimension
With that logic we can easily explain how we can have an input shape << size of units, it will give this calcul :
-> [BATCHSIZE, 20] * [20, 64] = [BATCHSIZE, 64] -> output dimension
Hope that helped you !
To learn more :
https://en.wikipedia.org/wiki/Matrix_multiplication
Related
If there are 10 features and 1 output class (sigmoid activation) with a regression objective:
If I use only 5 neurons in my first dense hidden layer: will the first error be calculated solely based on half of the training feature set? Isn't it imperative to match the # of features with the neurons in hidden layer #1 so that the model can see all the features at once? Otherwise it's not getting the whole picture? The first fwd propagation iteration would use 5 out of 10 features, and get the error value (and train during backprop, assume batch grad descent). Then the 2nd fwd propagation iteration would see the remaining 5 out of 10 features with updated weights and hopefully arrive at a smaller error. BUT its only seeing half the features at a time!
Conversely, if I have a convolutional 2D layer of 64 neurons. And my training shape is: (100, 28,28,1) (pictures of cats and dogs in greyscale), will each of the 64 neurons see a different 28x28 vector? No right, because it can only send one example through the forward propagation at a time? So then only a single picture (cat or dog) should be spanned across the 64 neurons? Why would you want that since each neuron in that layer has the same filter, stride, padding and activation function? When you define a Conv2D layer...the parameters of each neuron are the same. So is only a part of the training example going into each neuron? Why have 64 neurons, for example? Just have one neuron, use a filter on it and pass it along to a second hidden layer with another filter with different parameters!
Please explain the flaws in my logic. Thanks so much.
EDIT: I just realized for Conv2D, you flatten the training data sets so it becomes a 1D vector and so a 28x28 image would mean having an input conv2d layer of 724 neurons. But I am still confused for the dense neural network (paragraph #1 above)
What is your "first" layer?
Normally you have an input layer as first layer, which does not contain any weights.
The shape of the input layer must match the shape of your feature data.
So basically when you train a model with 10 features, but only have a input layer of shape (None,5) (where none stands for the batch_size), tensorflow will raise an exception, because it needs data for all inputs in the correct shape.
So what you said is just not going to happen. If you only have 5 features, the next 5 features wont be fit into the net in the next iteration but, the next sample will be send to the model instead. (Lets say no exception is thrown) So of the next sample also only the first 5 features would be used.
What you can do instead, use a input_layer as first layer with the correct shape of your features. Then as secodn layer, you can use any shape you like, 1,10,100 dense neurons, its up to you (and what works well of course). The shape of the output again must match (this time) the shape of your label data.
I hope this makes it more clear
I am new to machine learning and using neural networks with keras. I am trying to use reinforcement learning along with the help of a neural network, which may eventually predict the correct actions for a robot to take in a monopoly game, if it were to play against humans.
For this I am trying to use a neural network which receives an array of 23 float numbers (defining the players state), and outputs an array of 7 float numbers (the maximum number of possible actions that can be taken at a given time). My current NN is the following:
model = Sequential()
model.add(Dense(150, input_dim=23, activation='relu'))
model.add(Dense(7, activation='sigmoid'))
model.compile(loss='mse', optimizer=Adam(lr=0.2))
My intention is to have a 3 layer nn, with a 150 (neurons) hidden layer, and 7 neurons in the last layer.
#An input example would be:
state = [0.35,0.65,0.35,3.53...] # array of 23 items, float numbers.
output = model.predict(state)
#I expect output to be:
[0.21,0.12,0.98,0.32,0.44,0.12,0.41] #array size of 7
#Then I could simply just use the index with the highest number as the action to take.
action = output.index(max(output))
I am not sure why, but I get this error instead:
ValueError: Error when checking input: expected dense_23_input to have shape (23,) but got array with shape (1,)
I'm sure it would be better if I could just have a single last layer neuron predicting integer numbers in a range, for instance numbers 1 to 7. However, I do not know of any activation function which can do this. Please feel free to suggest better nn models for this purpose, I would highly appreciate. I am aware that this might not be the best possible model for this purpose.
But essentially, the main question here is, how do I input a single array size 23, and output an array of size 7?
Thank you!!
I quite not so familiar with keras, but in pytorch everything is expected to work in batches, and that's why you're getting more dimensions than you want.
The input for your first linear layer should has dimensions (batch_size,23). If you want to see how a single example runs throgh the network first reshape it like input.reshape(1,-1). The output will have dims (1,7). You should change the last layer activation to softmax
Thank you for your contribution! I managed to solve the issue. I finally ended up using a 3 layer nn with a single neuron output and a sigmoid activation function which looked like this:
model = Sequential()
model.add(Dense(150, input_shape=(23,), activation='relu'))
model.add(Dense(1, input_shape=(7,), activation='sigmoid'))
model.compile(loss='mse', optimizer=Adam(lr=0.2))
#Required input would look something like this:
input =np.array([0.2,0.1,0.5,0.5,0.8,0.3,0.2,0.2,0.2,0.9,0.2,0.8,0.6,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.1,0.5,0.4])
input = np.reshape(input,(1,-1))
#The output would look like something like this:
print(saved_model.predict(input))
#[[0.00249215 0.15893634 0.50805619 0.86176279 0.34417642 0.29258215
0.131994 ]]
From here, I would simply just get the index with the highest probability to determine the class of my input.
I have a medical longitudinal data on which I am doing a research.
To start with I am working with 4000 rows of sample with 3 time-steps(3 columns) of a bone size corresponding to size of bone measured in 3 different months.
I am done with the basic model. Now I want to be sure if my understanding of the model is correct.
model = Sequential()
model.add(layers.SimpleRNN(units=10, input_shape=(3,1),use_bias=True,bias_initializer='zeros',activation="relu",kernel_initializer="random_uniform"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='sgd')
model.summary()
model.fit(trainX,train_op, epochs=100, batch_size=50, verbose=2)
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
Following are my few doubts around this model :
Here return_sequences is False , then shouldn't I get only the last output from RNN layer. Why the output is of shape(None, 10) from RNN layer? I assumed it should be (sample size,1) .
Also my below mentioned logic is flawed but I need to resolve it which is :
Units corresponds to output units. Initially my guess was that since there are 3 time-steps there has to be 3 output units but I was surprised that even if give units= 128 or 10,1 the model worked. How and why it is happening ? This question along with the above one confuses me more.
input_shape corresponds to -[sample size, number of time steps, features]. Here, I am measuring 1 bone size over 3 time periods. Is my understanding correct when I say the input shape is (sample size, 3,1) ? Moreover, I have confusion regarding how numpy represents 3d array. It seems, to get required dimension I need to input as - #features, observations/sample_size, timesteps . Do I have to reshape my inputs according to how numpy represents 3d or should i let it be. ?
Moreover, how can I build a model if i have different set of features measured over different time frame or have various time steps ? How can i incorporate with the above model.
Yes, you get the last output, which is a 10-dimensional vector, not a one-dimensional vector, so getting shape (samples, 10) is correct.
Number of units has nothing to do with timesteps, the number of timesteps is how many times the neurons are applied recurrently, so its orthogonal to the number of features or units.
Yes, shape of your inputs should be (samples, 3, 1) and the input_shape should be (3, 1), all of this is correct in your code. I am not sure what you are talking about on "how numpy represents 3d array", the shape is clear, numpy does not do any modifications to input shapes.
I am working with LSTM for my time series forecasting problem. I have the following network:
model = Sequential()
model.add(LSTM(units_size=300, activation=activation, input_shape=(20, 1)))
model.add(Dense(20))
My forecasting problem is to forecast the next 20 time steps looking back the last 20 time steps. So, for each iteration, I have an input shape like (x_t-20...x_t) and forecast the next (x_t+1...x_t+20). For the hidden layer, I use 300 hidden units.
As LSTM is different than the simple feed-forward neural network, I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells? As I describe above, I have 20 time steps to predict and are these all steps generated from the last LSTM cels? I have no idea. Can some generally give a diagram example of this kind of network structure?
Regarding these questions,
I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells?
It is simpler to consider the LSTM layer you have defined as a single block. This diagram is heavily borrowed from Francois Chollet's Deep Learning with Python book:
In your model, input shape is defined as (20,1), so you have 20 time-steps of size 1. For a moment, consider that the output Dense layer is not present.
model = Sequential()
model.add(LSTM(300, input_shape=(20,1)))
model.summary()
lstm_7 (LSTM) (None, 300) 362400
The output shape of the LSTM layer is 300 which means the output is of size 300.
output = model.predict(np.zeros((1, 20, 1)))
print(output.shape)
(1, 300)
input (1,20,1) => batch size = 1, time-steps = 20, input-feature-size = 1.
output (1, 300) => batch size = 1, output-feature-size = 300
Keras recurrently ran the LSTM for 20 time-steps and generated an output of size (300). In the diagram above, this is Output t+19.
Now, if you add the Dense layer after LSTM, the output will be of size 20 which is straightforward.
To understand LSTMs, I'd recommend first spending a few minutes to understand 'plain vanilla' RNNs, as LSTMs are just a more complex version of that. I'll try to describe what's happening in your network if it was a basic RNN.
You are training a single set of weights that are repeatedly used for each time step (t-20,...,t). The first weight (let's say W1) is for inputs. One by one, each of x_t-20,...,x_t is multiplied by W1, then a non-linear activation function is applied - same as any NN forward pass.
The difference with RNNs is the presence of a separate 'state' (note: not a trained weight), that can start off random or zero, and carries information about your sequence across time steps. There's another weight for the state (W2). So starting at the first time step t-20, the initial state is multiplied by W2 and an activation function applied.
So at timestep t-20 we have the output from W1 (on inputs) and W2 (on state). We can combine these outputs at each timestep, and use it to generate the state to pass to the next timestep, i.e. t-19. Because the state has to be calculated at each timestep and passed to the next, these calculations have to happen sequentially starting from t-20. To generate our desired output, we can take each output state across all timesteps - or only take the output at the final timestep. As return_sequences=False by default in Keras, you are only using the output at the final timestep, which then goes into your dense layer.
The weights W1 and W2 need to have one dimension equal to the dimensions of each timestep input x_t-20... for matrix multiplication to work. This dimension is 1 in your case, as each of the 20 inputs are a 1d vector (or number), which is multiplied by W1. However, we're free to set the second dimension of the weights as we please - 300 in your case. So W1 is of size 1x300, and is multiplied 20 times, once for each timestep.
This lecture will take you through the basic flow diagram of RNNs that I described above, all the way to more advanced stuff which you can skip. This is a famous explanation of LSTMs if you want to make the leap from basic RNNs to LSTMs, which you may not need to do - there are just more complicated weights and states.
I'm following this tutorial on using Keras to train a basic conv-net. I find a couple of things confusing though, and the Keras documentation doesn't go into much detail either.
Let's look at the first few layers of the network:
model = Sequential()
model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1,28,28)))
model.add(Convolution2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
My Questions:
The tutorial describes the first layer as the "input layer". However, the first line includes a Convolution2D function, with a input_shape. Am I correct in assuming that this is actually the first hidden layer (a convolution layer), rather than just the input layer? Reason being that we don't need a separate model.add() statement just for the input?
In the Convolution2D() function, we're using 32 filters, each filter being 3x3 pixels. In my understanding, a filter is a small block of pixels which "scans" across the image. So for a 28x28 image, wouldn't we need 676 filters (26*26, since each filter is 3x3)? What does the 32 here mean?
The last line is a Dropout layer. From my understanding, Dropout is a regularization technique, and it's applied to the whole network. So does the Dropout(0.25) here apply a 25% dropout only to the previous layer? Or does it apply to all layers preceding it?
Thanks.
Whenever you call model.fit(), you are passing the 28 * 28 image. So that is the input to the model.
On top of that input, we then are doing convolution to generate feature maps.
Convolution means matrix multiplication. So the output of a single filter in the first layer is 26 * 26 matrix. We have such 32 matrices. That is what 32 means. Check this for further explanation. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
This applies dropout with probability 0.25 to the layer preceding the sentence.