Keras model structure questions

Keras model structure questions - python

I'm following this tutorial on using Keras to train a basic conv-net. I find a couple of things confusing though, and the Keras documentation doesn't go into much detail either.
Let's look at the first few layers of the network:
model = Sequential()
model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1,28,28)))
model.add(Convolution2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
My Questions:
The tutorial describes the first layer as the "input layer". However, the first line includes a Convolution2D function, with a input_shape. Am I correct in assuming that this is actually the first hidden layer (a convolution layer), rather than just the input layer? Reason being that we don't need a separate model.add() statement just for the input?
In the Convolution2D() function, we're using 32 filters, each filter being 3x3 pixels. In my understanding, a filter is a small block of pixels which "scans" across the image. So for a 28x28 image, wouldn't we need 676 filters (26*26, since each filter is 3x3)? What does the 32 here mean?
The last line is a Dropout layer. From my understanding, Dropout is a regularization technique, and it's applied to the whole network. So does the Dropout(0.25) here apply a 25% dropout only to the previous layer? Or does it apply to all layers preceding it?
Thanks.

Whenever you call model.fit(), you are passing the 28 * 28 image. So that is the input to the model.
On top of that input, we then are doing convolution to generate feature maps.
Convolution means matrix multiplication. So the output of a single filter in the first layer is 26 * 26 matrix. We have such 32 matrices. That is what 32 means. Check this for further explanation. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
This applies dropout with probability 0.25 to the layer preceding the sentence.

Related

Layers to be used after using a pretrained model: When to add GlobalAveragePooling2D()

I am using pretrained models to classify image. My question is what kind of layers do I have to add after using the pretrained model structure in my model, resp. why these two implementations differ. To be specific:
Consider two examples, one using the cats and dogs dataset:
One implementation can be found here. The crucial point is that the base model:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
is frozen and a GlobalAveragePooling2D() is added, before a final tf.keras.layers.Dense(1) is added. So the model structure looks like:
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
which is equivalent to:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D()
tf.keras.layers.Dense(1)
])
So they added not only a final dense(1) layer, but also a GlobalAveragePooling2D() layer before.
The other using the tf flowers dataset:
In this implementation it is different. A GlobalAveragePooling2D() is not added.
feature_extractor_url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(224,224,3))
feature_extractor_layer.trainable = False
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(image_data.num_classes)
])
Where image_data.num_classes is 5 representing the different flower classification. So in this example a GlobalAveragePooling2D() layer is not added.
I do not understand this. Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
I am not sure if the reason is that in one case the dataset cats and dogs is binary classification and in the other it is a multiclass classifcation problem. Or the difference is that in one case tf.keras.applications.MobileNetV2 was used to load MobileNetV2 and in the other implementation hub.KerasLayer was used to get the feature_extractor. When I check the model in the first implementation:
I can see that the last layer is a relu activation layer.
When I check the feature_extractor:
model = tf.keras.Sequential([
feature_extractor,
tf.keras.layers.Dense(1)
])
model.summary()
I get the output:
So maybe reason is also that I do not understand the difference between tf.keras.applications.MobileNetV2 vs hub.KerasLayer. The hub.KerasLayer just gives me the feature extractor. I know this, but still I think I did not get the difference between these two methods.
I cannot check the layers of the feature_extractor itself. So feature_extractor.summary() or feature_extractor.layers does not work. How can I inspect the layers here? And how can I know I should add GlobalAveragePooling2D or not?

Summary
Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
The first case it outputs 4 dimensional tensors that are raw outputs of the last convolutional layer. So, you need to flatten them somehow, and in this example you are using GlobalAveragePooling2D (but you could use any other strategy). I can't tell which is better: it depends on your problem, and depending on how hub.KerasLayer version implemented the flatten, they could be exactly the same. That said, I'd just pickup one of them and go on: I don't see huge differences among them,
Long answer: understanding Keras implementation
The difference is in the output of both base models: in your keras examples, outputs are of shape (bz, hh, ww, nf) where bz is batch size, hh and ww are height and weight of the last convolutional layer in the model and nf is the number of filters (or convolutions) applied in this last layer.
So: this outputs the output of the last convolutions (or filters) of the base model.
Hence, you need to convert those outputs (which you can think them as images) to vectors of shape (bz, n_feats), where n_feats is the number of features the base model is computing. Once this conversion is done, you can stack your classification layer (or as many layers as you want) because at this point you have vectors.
How to compute this conversion? Some common alternatives are taking the average or maximum among the convolutional output (which reduces the size), or you could just reshape them as a single row, or add more convolutional layers until you get a vector as an output (I strongly suggest to follow usual practices like average or maximum).
In your first example, when calling tf.keras.applications.MobileNetV2, you are using the default police with respect to this last year, and hence, the last convolutional layer is let "as is": some convolutions. You can modify this behavior with the param pooling, as documented here:
pooling: Optional pooling mode for feature extraction when include_top is False.
None (default) means that the output of the model will be the 4D tensor output of the last convolutional block.
avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor.
max means that global max pooling will be applied.
In summary, in your first example, you are building the base model without telling explicitly what to do with the last layer, the model keeps returning 4 dimensional tensors that you immediately convert to vectors with the usage of average pooling, so you can avoid this explicit average pooling if you tell Keras to do it:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
pooling='avg', # Tell keras to average last layer
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
# global_average_layer, -> not needed any more
prediction_layer
])
TFHub implementation
Finally, when you use the TensorFlow Hub implementation, as you picked up the feature_vector version of the model, it already implements some kind of pooling (which I didn't found yet how) to make sure the model outputs vectors rather than 4 dimensional tensors. So, you don't need to add explicitly the layer to convert them because it is already done.
In my opinion, I prefer Keras implementation since it gives you more freedom to pick the strategy you want (in fact you could keep stacking whatever you want).

Lets say there is a model taking [1, 208, 208, 3] images and has 6 pooling layers with kernels [2, 2, 2, 2, 2, 7] which would result in a feature column for image [1, 1, 1, 2048] for 2048 filters in the last conv layer. Note, how the last pooling layer accepts [1, 7, 7, 2048] inputs
If we relax the constrains for the input image (which is typically the case for object deteciton models) than after same set of pooling layers image of size [1, 104, 208, 3] would produce pre-last-pooling output of [1, 4, 7, 2024] and [1, 256, 408, 3] would yeild [1, 8, 13, 2048]. This maps would have about the same amount information as original [1, 7, 7, 2048] but the original pooling layer would not produce a feature column wiht [1, 1, 1, N]. That is why we switch to global pooling layer.
In short, global pooling layer is important if we don't have strict restriction on the input image size (and don't resize the image as the first op in the model).

I think difference in output of models
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2" has output is 1d vector * batch_size, you just can't apply Conv2D to it.
Output of tf.keras.applications.MobileNetV2 probably more complex, thus you have more capability to transform one.

How to input an array of n items and output an array of size k in a neural network using keras?

I am new to machine learning and using neural networks with keras. I am trying to use reinforcement learning along with the help of a neural network, which may eventually predict the correct actions for a robot to take in a monopoly game, if it were to play against humans.
For this I am trying to use a neural network which receives an array of 23 float numbers (defining the players state), and outputs an array of 7 float numbers (the maximum number of possible actions that can be taken at a given time). My current NN is the following:
model = Sequential()
model.add(Dense(150, input_dim=23, activation='relu'))
model.add(Dense(7, activation='sigmoid'))
model.compile(loss='mse', optimizer=Adam(lr=0.2))
My intention is to have a 3 layer nn, with a 150 (neurons) hidden layer, and 7 neurons in the last layer.
#An input example would be:
state = [0.35,0.65,0.35,3.53...] # array of 23 items, float numbers.
output = model.predict(state)
#I expect output to be:
[0.21,0.12,0.98,0.32,0.44,0.12,0.41] #array size of 7
#Then I could simply just use the index with the highest number as the action to take.
action = output.index(max(output))
I am not sure why, but I get this error instead:
ValueError: Error when checking input: expected dense_23_input to have shape (23,) but got array with shape (1,)
I'm sure it would be better if I could just have a single last layer neuron predicting integer numbers in a range, for instance numbers 1 to 7. However, I do not know of any activation function which can do this. Please feel free to suggest better nn models for this purpose, I would highly appreciate. I am aware that this might not be the best possible model for this purpose.
But essentially, the main question here is, how do I input a single array size 23, and output an array of size 7?
Thank you!!

I quite not so familiar with keras, but in pytorch everything is expected to work in batches, and that's why you're getting more dimensions than you want.
The input for your first linear layer should has dimensions (batch_size,23). If you want to see how a single example runs throgh the network first reshape it like input.reshape(1,-1). The output will have dims (1,7). You should change the last layer activation to softmax

Thank you for your contribution! I managed to solve the issue. I finally ended up using a 3 layer nn with a single neuron output and a sigmoid activation function which looked like this:
model = Sequential()
model.add(Dense(150, input_shape=(23,), activation='relu'))
model.add(Dense(1, input_shape=(7,), activation='sigmoid'))
model.compile(loss='mse', optimizer=Adam(lr=0.2))
#Required input would look something like this:
input =np.array([0.2,0.1,0.5,0.5,0.8,0.3,0.2,0.2,0.2,0.9,0.2,0.8,0.6,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.1,0.5,0.4])
input = np.reshape(input,(1,-1))
#The output would look like something like this:
print(saved_model.predict(input))
#[[0.00249215 0.15893634 0.50805619 0.86176279 0.34417642 0.29258215
0.131994 ]]
From here, I would simply just get the index with the highest probability to determine the class of my input.

Units in Dense layer in Keras

I am trying to understand a concept of ANN architecture in Keras. Number of input neurons in any NN should be equal to the number of features/attributes/columns. So, in the case of having matrix of (20000,100), my input shape should have 100 neurons. In the example on the Keras page, I saw a code:
model = Sequential([Dense(32, input_shape=(784,)),
, which pretty much means that input shape has 784 columns and 32 is the dimensionality of output space, which pretty means that the second layer will have an input of 32. My understanding is that such a significant drop happens because some of the units are not activated due to an activation function. Is my understanding correct?
At the same time, another piece of code, shows that number of input neurons is higher than number of features:
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
This example is not clear to me. How can it be that the size of units is larger that number of input dimensions?

The total number of neurons in a Dense layer is a topic that is still not agreed upon within the machine learning and data science community. There are many heuristics that are used to define this and I refer you to this post on Cross Validated that provides some more details: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw.
In summary, the number of hidden units between both methods as you have specified most likely originated from repeated experimentation and trial-and-error achieving the best accuracy.
However, for more context the answer to this as I mentioned is through experimentation. 784 for the input neurons most likely comes from the MNIST dataset, which are images that are 28 x 28 = 784. I've seen implementations of neural networks where 32 neurons for the hidden layer is good. Think of each layer as a dimensionality transformation. Even if you go down to 32 dimensions, that doesn't necessarily mean that it will lose accuracy. Also from going from a lower dimensional space to higher dimensional space, that's common if you are trying to map your points to a new space that may be easier for classification.
Finally, in Keras, that number specifies how many neurons are for the current layer. Under the hood, it figures out the weight matrix to satisfy the forward propagation going from the previous layer to the current layer. It would be 785 x 32 in that case with 1 extra neuron for the bias unit.

Neural Networks are basicly matrix multiplications, the drop you are talking about in the first part is not due to an Activation function, it's only happen because of the nature of matrix multiplication :
The calcul here is : input * weights = output
so -> [BATCHSIZE, 784] * [784, 32] = [BATCHSIZE, 32] -> output dimension
With that logic we can easily explain how we can have an input shape << size of units, it will give this calcul :
-> [BATCHSIZE, 20] * [20, 64] = [BATCHSIZE, 64] -> output dimension
Hope that helped you !
To learn more :
https://en.wikipedia.org/wiki/Matrix_multiplication

How to handle variable sized input in CNN with Keras?

I am trying to perform the usual classification on the MNIST database but with randomly cropped digits.
Images are cropped the following way : removed randomly first/last and/or row/column.
I would like to use a Convolutional Neural Network using Keras (and Tensorflow backend) to perform convolution and then the usual classification.
Inputs are of variable size and i can't manage to get it to work.
Here is how I cropped digits
import numpy as np
from keras.utils import to_categorical
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.images
X = np.expand_dims(X, axis=3)
X_crop = list()
for index in range(len(X)):
X_crop.append(X[index, np.random.randint(0,2):np.random.randint(7,9), np.random.randint(0,2):np.random.randint(7,9), :])
X_crop = np.array(X_crop)
y = to_categorical(digits.target)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_crop, y, train_size=0.8, test_size=0.2)
And here is the architecture of the model I want to use
from keras.layers import Dense, Dropout
from keras.layers.convolutional import Conv2D
from keras.models import Sequential
model = Sequential()
model.add(Conv2D(filters=10,
kernel_size=(3,3),
input_shape=(None, None, 1),
data_format='channels_last'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train, epochs=100, batch_size=16, validation_data=(X_test, y_test))
Does someone have an idea on how to handle variable sized input in my neural network?
And how to perform classification?

TL/DR - go to point 4
So - before we get to the point - let's fix some problems with your network:
Your network will not work because of activation: with categorical_crossentropy you need to have a softmax activation:
model.add(Dense(10, activation='softmax'))
Vectorize spatial tensors: as Daniel mentioned - you need to, at some stage, switch your vectors from spatial (images) to vectorized (vectors). Currently - applying Dense to output from a Conv2D is equivalent to (1, 1) convolution. So basically - output from your network is spatial - not vectorized what causes dimensionality mismatch (you can check that by running your network or checking the model.summary(). In order to change that you need to use either GlobalMaxPooling2D or GlobalAveragePooling2D. E.g.:
model.add(Conv2D(filters=10,
kernel_size=(3, 3),
input_shape=(None, None, 1),
padding="same",
data_format='channels_last'))
model.add(GlobalMaxPooling2D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
Concatenated numpy arrays need to have the same shape: if you check the shape of X_crop you'll see that it's not a spatial matrix. It's because you concatenated matrices with different shapes. Sadly - it's impossible to overcome this issue as numpy.array need to have a fixed shape.
How to make your network train on examples of different shape: The most important thing in doing this is to understand two things. First - is that in a single batch every image should have the same size. Second - is that calling fit multiple times is a bad idea - as you reset inner model states. So here is what needs to be done:
a. Write a function which crops a single batch - e.g. a get_cropped_batches_generator which given a matrix cuts a batch out of it and crops it randomly.
b. Use train_on_batch method. Here is an example code:
from six import next
batches_generator = get_cropped_batches_generator(X, batch_size=16)
losses = list()
for epoch_nb in range(nb_of_epochs):
epoch_losses = list()
for batch_nb in range(nb_of_batches):
# cropped_x has a different shape for different batches (in general)
cropped_x, cropped_y = next(batches_generator)
current_loss = model.train_on_batch(cropped_x, cropped_y)
epoch_losses.append(current_loss)
losses.append(epoch_losses.sum() / (1.0 * len(epoch_losses))
final_loss = losses.sum() / (1.0 * len(losses))
So - a few comments to code above: First, train_on_batch doesn't use nice keras progress bar. It returns a single loss value (for a given batch) - that's why I added logic to compute loss. You could use Progbar callback for that also. Second - you need to implement get_cropped_batches_generator - I haven't written a code to keep my answer a little bit more clear. You could ask another question on how to implement it. Last thing - I use six to keep compatibility between Python 2 and Python 3.

Usually, a model containing Dense layers cannot have variable size inputs, unless the outputs are also variable. But see the workaround and also the other answer using GlobalMaxPooling2D - The workaround is equivalent to GlobalAveragePooling2D. These are layers that can eliminiate the variable size before a Dense layer and suppress the spatial dimensions.
For an image classification case, you may want to resize the images outside the model.
When my images are in numpy format, I resize them like this:
from PIL import Image
im = Image.fromarray(imgNumpy)
im = im.resize(newSize,Image.LANCZOS) #you can use options other than LANCZOS as well
imgNumpy = np.asarray(im)
Why?
A convolutional layer has its weights as filters. There is a static filter size, and the same filter is applied to the image over and over.
But a dense layer has its weights based on the input. If there is 1 input, there is a set of weights. If there are 2 inputs, you've got twice as much weights. But weights must be trained, and changing the amount of weights will definitely change the result of the model.
As #Marcin commented, what I've said is true when your input shape for Dense layers has two dimensions: (batchSize,inputFeatures).
But actually keras dense layers can accept inputs with more dimensions. These additional dimensions (which come out of the convolutional layers) can vary in size. But this would make the output of these dense layers also variable in size.
Nonetheless, at the end you will need a fixed size for classification: 10 classes and that's it. For reducing the dimensions, people often use Flatten layers, and the error will appear here.
A possible fishy workaround (not tested):
At the end of the convolutional part of the model, use a lambda layer to condense all the values in a fixed size tensor, probably taking a mean of the side dimensions and keeping the channels (channels are not variable)
Suppose the last convolutional layer is:
model.add(Conv2D(filters,kernel_size,...))
#so its output shape is (None,None,None,filters) = (batchSize,side1,side2,filters)
Let's add a lambda layer to condense the spatial dimensions and keep only the filters dimension:
import keras.backend as K
def collapseSides(x):
axis=1 #if you're using the channels_last format (default)
axis=-1 #if you're using the channels_first format
#x has shape (batchSize, side1, side2, filters)
step1 = K.mean(x,axis=axis) #mean of side1
return K.mean(step1,axis=axis) #mean of side2
#this will result in a tensor shape of (batchSize,filters)
Since the amount of filters is fixed (you have kicked out the None dimensions), the dense layers should probably work:
model.add(Lambda(collapseSides,output_shape=(filters,)))
model.add(Dense.......)
.....
In order for this to possibly work, I suggest that the number of filters in the last convolutional layer be at least 10.
With this, you can make input_shape=(None,None,1)
If you're doing this, remember that you can only pass input data with a fixed size per batch. So you have to separate your entire data in smaller batches, each batch having images all of the same size. See here: Keras misinterprets training data shape

Keras: reshape to connect lstm and conv

This question exists as a github issue , too.
I would like to build a neural network in Keras which contains both 2D convolutions and an LSTM layer.
The network should classify MNIST.
The training data in MNIST are 60000 grey-scale images of handwritten digits from 0 to 9. Each image is 28x28 pixels.
I've splitted the images into four parts (left/right, up/down) and rearranged them in four orders to get sequences for the LSTM.
| | |1 | 2|
|image| -> ------- -> 4 sequences: |1|2|3|4|, |4|3|2|1|, |1|3|2|4|, |4|2|3|1|
| | |3 | 4|
One of the small sub-images has the dimension 14 x 14. The four sequences are stacked together along the width (shouldn't matter whether width or height).
This creates a vector with the shape [60000, 4, 1, 56, 14] where:
60000 is the number of samples
4 is the number of elements in a sequence (# of timesteps)
1 is the depth of colors (greyscale)
56 and 14 are width and height
Now this should be given to a Keras model.
The problem is to change the input dimensions between the CNN and the LSTM.
I searched online and found this question: Python keras how to change the size of input after convolution layer into lstm layer
The solution seems to be a Reshape layer which flattens the image but retains the timesteps (as opposed to a Flatten layer which would collapse everything but the batch_size).
Here's my code so far:
nb_filters=32
kernel_size=(3,3)
pool_size=(2,2)
nb_classes=10
batch_size=64
model=Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
border_mode="valid", input_shape=[1,56,14]))
model.add(Activation("relu"))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Reshape((56*14,)))
model.add(Dropout(0.25))
model.add(LSTM(5))
model.add(Dense(50))
model.add(Dense(nb_classes))
model.add(Activation("softmax"))
This code creates an error message:
ValueError: total size of new array must be unchanged
Apparently the input to the Reshape layer is incorrect. As an alternative, I tried to pass the timesteps to the Reshape layer, too:
model.add(Reshape((4,56*14)))
This doesn't feel right and in any case, the error stays the same.
Am I doing this the right way ?
Is a Reshape layer the proper tool to connect CNN and LSTM ?
There are rather complex approaches to this problem.
Such as this:
https://github.com/fchollet/keras/pull/1456
A TimeDistributed Layer which seems to hide the timestep dimension from following layers.
Or this: https://github.com/anayebi/keras-extra
A set of special layers for combining CNNs and LSTMs.
Why are there so complicated (at least they seem complicated to me) solutions, if a simple Reshape does the trick ?
UPDATE:
Embarrassingly, I forgot that the dimensions will be changed by the pooling and (for lack of padding) the convolutions, too.
kgrm advised me to use model.summary() to check the dimensions.
The output of the layer before the Reshape layer is (None, 32, 26, 5),
I changed the reshape to: model.add(Reshape((32*26*5,))).
Now the ValueError is gone, instead the LSTM complains:
Exception: Input 0 is incompatible with layer lstm_5: expected ndim=3, found ndim=2
It seems like I need to pass the timestep dimension through the entire network. How can I do that ? If I add it to the input_shape of the Convolution, it complains, too: Convolution2D(nb_filters, kernel_size[0], kernel_size[1], border_mode="valid", input_shape=[4, 1, 56,14])
Exception: Input 0 is incompatible with layer convolution2d_44: expected ndim=4, found ndim=5

According to Convolution2D definition your input must be 4-dimensional with dimensions (samples, channels, rows, cols). This is the direct reason why are you getting an error.
To resolve that you must use TimeDistributed wrapper. This allows you to use static (not recurrent) layers across the time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.