that may be a silly question, but I wanted to use a convolutional neural network in my deep reinforcement learning project and I got a problem I don't understand.
In my project I want to insert into network matrix 6x7 which should be equivalent to black and white picture of 6x7 size (42 pixels) right?
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.model = torch.nn.Sequential()
self.model.add_module("conv_1", torch.nn.Conv2d(in_channels=1, out_channels=16, kernel_size=4, stride = 1))
self.model.add_module("relu_1", torch.nn.ReLU())
self.model.add_module("max_pool", torch.nn.MaxPool2d(2))
self.model.add_module("conv_2", torch.nn.Conv2d(in_channels=16, out_channels=16, kernel_size=4, stride = 1))
self.model.add_module("relu_2", torch.nn.ReLU())
self.model.add_module("flatten", torch.nn.Flatten())
self.model.add_module("linear", torch.nn.Linear(in_features=16*16*16, out_features=7))
def forward(self, x):
x = self.model(x)
return x
In conv1 in_channels=1 because I have got only 1 matrix (if it was image recognition that means 1 color). Other in_channels and out_channels are kind of random until linear. I have no idea where I should insert the size of a matrix, but the final output should be a size of 7 which i typed in linear.
The error i get is:
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [6, 7]
There are a few problems with your code. First, the reason you're getting that error message is because the CNN is expecting a tensor with shape (N, Cin, Hin, Win), where:
N is the batch size
Cin is the number of input channels
Hin is the input image pixel height
Win is the input image pixel width
You're only providing the width and height dimensions. You need to explicitly add a channels and batch dimension, even if the value of those dimensions is only 1:
model = CNN()
example_input = torch.randn(size=(6, 7)) # this is your input image
print(example_input.shape) # should be (6, 7)
output = model(example_input) # you original error
example_input = example_input.unsqueeze(0).unsqueeze(0) # adds batch and channels dimension
print(example_input.shape) # should now be (1, 1, 6, 7)
output = model(example_input) # no more error!
You'll note however, you get a different error now:
RuntimeError: Calculated padded input size per channel: (1 x 2). Kernel size: (4 x 4). Kernel size can't be greater than actual input size
This is because after the first conv layer, your data is of shape 1x2, but your kernel size for the second layer is 4, which makes the operation impossible. An input image of size 6x7 is quite small, either reduce the kernel size to something that works, or use a larger images.
Here's a working example:
import torch
from torch import nn
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.model = torch.nn.Sequential()
self.model.add_module(
"conv_1",
torch.nn.Conv2d(in_channels=1, out_channels=16, kernel_size=2, stride=1),
)
self.model.add_module("relu_1", torch.nn.ReLU())
self.model.add_module("max_pool", torch.nn.MaxPool2d(2))
self.model.add_module(
"conv_2",
torch.nn.Conv2d(in_channels=16, out_channels=16, kernel_size=2, stride=1),
)
self.model.add_module("relu_2", torch.nn.ReLU())
self.model.add_module("flatten", torch.nn.Flatten())
self.model.add_module("linear", torch.nn.Linear(in_features=32, out_features=7))
def forward(self, x):
x = self.model(x)
return x
model = CNN()
x = torch.randn(size=(6, 7))
x = x.unsqueeze(0).unsqueeze(0)
output = model(x)
print(output.shape) # has shape (1, 7)
Note, I changed the kernel_size to 2, and the final linear layer has an input size of 32. Also, the output has shape (1, 7), the 1 is the batch_size, which in our case was only 1. If you want just the 7 output features, just use x = torch.squeeze(x).
Related
I'm using a convolutional neural network (CNN) to preprocess my input for a Long Short-Term Memory (LSTM). I have the following input dimensions: 128 x 10 x 3 x 32 x 32 (batch size, sequence length, color channel, height, width) and would like to obtain the following output dimensions: 128 x 10 x 480 (batch size, sequence length, output CNN/ input LSTM size). Is the code below maintaining the sequence dimension correctly? Am I processing multiple images independently here or are the dimensions getting mixed up? Inputs and Outputs are shaped as they should be, but I'm uncertain about the intermedia steps.
class CNN(nn.Module):
def __init__(self):
super(CNN_coords, self).__init__()
self.conv1 = nn.Conv2d(3, 10, 5)
self.conv2 = nn.Conv2d(10, 20, 5)
self.conv3 = nn.Conv2d(20, 30, 5)
self.pool = nn.MaxPool2d(2, 2)
def forward(self, i):
x = i.reshape(-1, i.shape[2], i.shape[3], i.shape[4])
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(i.shape[0], i.shape[1], -1)
return x
The goal is to leave the time dimension for the LSTM. The architecture looks like this:
How to solve this error?
Preprocessing of image:
def PreprocessData(img, mask, target_shape_img, target_shape_mask, path1, path2):
"""
Processes the images and mask present in the shared list and path
Returns a NumPy dataset with images as 3-D arrays of desired size
"""
# Pull the relevant dimensions for image and mask
m = len(img) # number of images
i_h,i_w,i_c = target_shape_img # pull height, width, and channels of image
m_h,m_w,m_c = target_shape_mask # pull height, width, and channels of mask
# Define X and Y as number of images along with shape of one image
X = np.zeros((m,i_h,i_w,1), dtype=np.float32)
y = np.zeros((m,m_h,m_w,1), dtype=np.int32)
# RGBA image has 4 channels.
#255 will make the pixel completely opaque,
#value 0 fully transparent,
#values in between will make the pixels partly transparent
# Resize images and masks
for file in img:
# convert image into an array of desired shape (3 channels)
index = img.index(file)
path = os.path.join(path1, file)
single_img = np.asarray(Image.open(path).resize((i_h,i_w))) # (0.21, 0.75, 0.04)
#single_img = np.reshape(single_img,(i_h,i_w,i_c))
single_img = single_img/255.
X[index] = single_img[..., None] #X (dims: # images, img height, img width, img channels)
# convert mask into an array of desired shape 4 channel
single_mask_ind = mask[index]
path = os.path.join(path2, single_mask_ind)
single_mask = np.asarray(Image.open(path).resize((i_h,i_w)))
single_mask = single_mask > 0 #binarizing of targets
# single_mask = single_mask - 1 ### single_mask = single_mask/256???
y[index] = single_mask[..., None] #y (dims: # masks, mask height, mask width, mask channels)
return X, y
Encoder:
def EncoderMiniBlock(inputs, n_filters=32, dropout_prob=0.3, max_pooling=True):
"""
This block uses multiple convolution layers, max pool, relu activation to create an architecture for learning.
Dropout can be added for regularization to prevent overfitting.
The block returns the activation values for next layer along with a skip connection which will be used in the decoder
"""
# Add 2 Conv Layers with relu activation and HeNormal initialization using TensorFlow
# Proper initialization prevents from the problem of exploding and vanishing gradients
# 'Same' padding will pad the input to conv layer such that the output has the same height and width (hence, is not reduced in size)
conv = Conv2D(n_filters,
3, # Kernel size
activation='relu',
padding='same',
kernel_initializer='HeNormal')(inputs)
conv = Conv2D(n_filters,
3, # Kernel size
activation='relu',
padding='same',
kernel_initializer='HeNormal')(conv)
# Batch Normalization will normalize the output of the last layer based on the batch's mean and standard deviation
conv = BatchNormalization()(conv, training=False)
# In case of overfitting, dropout will regularize the loss and gradient computation to shrink the influence of weights on output
if dropout_prob > 0:
conv = tf.keras.layers.Dropout(dropout_prob)(conv)
# Pooling reduces the size of the image while keeping the number of channels same
# Pooling has been kept as optional as the last encoder layer does not use pooling (hence, makes the encoder block flexible to use)
# Below, Max pooling considers the maximum of the input slice for output computation and uses stride of 2 to traverse across input image
if max_pooling:
next_layer = tf.keras.layers.MaxPooling2D(pool_size = (2,2))(conv)
else:
next_layer = conv
# skip connection (without max pooling) will be input to the decoder layer to prevent information loss during transpose convolutions
skip_connection = conv
return next_layer, skip_connection
Decoder:
def DecoderMiniBlock(prev_layer_input, skip_layer_input, n_filters=32):
"""
Decoder Block first uses transpose convolution to upscale the image to a bigger size and then,
merges the result with skip layer results from encoder block
Adding 2 convolutions with 'same' padding helps further increase the depth of the network for better predictions
The function returns the decoded layer output
"""
# Start with a transpose convolution layer to first increase the size of the image
up = Conv2DTranspose(
n_filters,
(3,3), # Kernel size
strides=(2,2),
padding='same')(prev_layer_input)
# Merge the skip connection from previous block to prevent information loss
merge = concatenate([up, skip_layer_input], axis=3)
# Add 2 Conv Layers with relu activation and HeNormal initialization for further processing
# The parameters for the function are similar to encoder
conv = Conv2D(n_filters,
3, # Kernel size
activation='relu',
padding='same',
kernel_initializer='HeNormal')(merge)
conv = Conv2D(n_filters,
3, # Kernel size
activation='relu',
padding='same',
kernel_initializer='HeNormal')(conv)
return conv
U-Net compilation
def UNetCompiled(input_size=(128, 128, 3), n_filters=32, n_classes=3):
"""
Combine both encoder and decoder blocks according to the U-Net research paper
Return the model as output
"""
# Input size represent the size of 1 image (the size used for pre-processing)
inputs = Input(input_size)
# Encoder includes multiple convolutional mini blocks with different maxpooling, dropout and filter parameters
# Observe that the filters are increasing as we go deeper into the network which will increase the # channels of the image
cblock1 = EncoderMiniBlock(inputs, n_filters,dropout_prob=0, max_pooling=True)
cblock2 = EncoderMiniBlock(cblock1[0],n_filters*2,dropout_prob=0, max_pooling=True)
cblock3 = EncoderMiniBlock(cblock2[0], n_filters*4,dropout_prob=0, max_pooling=True)
cblock4 = EncoderMiniBlock(cblock3[0], n_filters*8,dropout_prob=0.3, max_pooling=True)
cblock5 = EncoderMiniBlock(cblock4[0], n_filters*16, dropout_prob=0.3, max_pooling=False)
# Decoder includes multiple mini blocks with decreasing number of filters
# Observe the skip connections from the encoder are given as input to the decoder
# Recall the 2nd output of encoder block was skip connection, hence cblockn[1] is used
ublock6 = DecoderMiniBlock(cblock5[0], cblock4[1], n_filters * 8)
ublock7 = DecoderMiniBlock(ublock6, cblock3[1], n_filters * 4)
ublock8 = DecoderMiniBlock(ublock7, cblock2[1], n_filters * 2)
ublock9 = DecoderMiniBlock(ublock8, cblock1[1], n_filters)
# Complete the model with 1 3x3 convolution layer (Same as the prev Conv Layers)
# Followed by a 1x1 Conv layer to get the image to the desired size.
# Observe the number of channels will be equal to number of output classes
conv9 = Conv2D(n_filters,
3,
activation='relu',
padding='same',
kernel_initializer='he_normal')(ublock9)
conv10 = Conv2D(n_classes, 1, padding='same')(conv9)
# Define the model
model = tf.keras.Model(inputs=inputs, outputs=conv10)
return model
Define the desired shape
target_shape_img = [128, 128, 3]
target_shape_mask = [128, 128,1]
Process data using apt helper function
X, y = PreprocessData(img, mask, target_shape_img, target_shape_mask, path1, path2)
I am not able to understand what is wrong because I am getting this error:
ValueError: in user code:
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py",
line 1021, in train_function *
return step_function(self, iterator)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py",
line 1010, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py",
line 1000, in run_step **
outputs = model.train_step(data)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py",
line 859, in train_step
y_pred = self(x, training=True)
File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py",
line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py",
line 249, in assert_input_compatibility
f'Input {input_index} of layer "{layer_name}" is '
ValueError: Exception encountered when calling layer "model" (type Functional).
Input 0 of layer "conv2d" is incompatible with the layer: expected axis -1 of input shape to have value 3, but received input with shape None, 128, 128, 1)
Call arguments received:
• inputs=tf.Tensor(shape=(None, 128, 128, 1), dtype=float32)
• training=True
• mask=None
You seem to have defined a model that takes inputs of shape (128,128,3) and are inputting shape (128,128,1). If you change the input shape when you define the UNetCompiled function, it should solve the issue.
def UNetCompiled(input_size=(128, 128, 1), n_filters=32, n_classes=3):
Or you could change the input shape in the PreprocessData function if the images are colour and not greyscale images
You have defined the images as having 1 channel
# Define X and Y as number of images along with shape of one image
X = np.zeros((m,i_h,i_w,1), dtype=np.float32)
y = np.zeros((m,m_h,m_w,1), dtype=np.int32)
but in the next line have written # RGBA image has 4 channels.
If your input image has 4 channels, both the images and the model input_shape needs to reflect this
I have input data for my 2D CNN model, say; X_train with shape (torch.Size([716, 50, 50])
my model is:
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=4,stride=1,padding = 1)
self.mp1 = nn.MaxPool2d(kernel_size=4,stride=2)
self.conv2 = nn.Conv2d(32,64, kernel_size=4,stride =1)
self.mp2 = nn.MaxPool2d(kernel_size=4,stride=2)
self.fc1= nn.Linear(2304,256)
self.dp1 = nn.Dropout(p=0.2)
self.fc2 = nn.Linear(256,10)
def forward(self, x):
in_size = x.size(0)
x = F.relu(self.mp1(self.conv1(x)))
x = F.relu(self.mp2(self.conv2(x)))
x = x.view(in_size,-1)
x = F.relu(self.fc1(x))
x = self.dp1(x)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
but when I run the model, I always get this error:
---> x = F.relu(self.mp1(self.conv1(x)))
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [32, 1, 4, 4], but got 3-dimensional input of size [64, 50, 50] instead
I understand my input for the model is of size 64 (batch size), 50*50 (size of each input, in this case is signal picture).
But I don't understand why it still requires 4-dimensional input where I had set my in_channels for nn.Conv2d to be 1.
How to solve this input dimension problem or to change the dimension requirement of model input?
Whether in_channels is 1 or 42 does not matter: it is still an added dimension. It is useful to read the documentation in this respect.
In- and output are of the form N, C, H, W
N: batch size
C: channels
H: height in pixels
W: width in pixels
So you need to add the dimension in your case:
# Add a dimension at index 1
x = x.unsqueeze(1)
That's the problem...
You've entered the in_channels=1, That doesn't mean that It doesn't exists...
Expanding the Dimension of Your Data to [64, 1, 50, 50] should solve your problem
use .view() on input tensor
Given a pytorch input dataset with dimensions:
dat.shape = torch.Size([128, 3, 64, 64])
This is a supervised learning problem: we have a separate labels.txt file containing one of C classes for each input observation. The value of C is calculated by the number of distinct values in the labeles file and is presently in the single digits.
I could use assistance on how to mesh the layers of a simple mix of convolutional and linear layers network that is performing multiclass classification. The intent is to pass through:
two cnn layers with maxpooling after each
a linear "readout" layer
softmax activation before the output/labels
Here is the core of my (faulty/broken) network. I am unable to determine the proper size/shape required of:
Output of Convolutional layer -> Input of Linear [Readout] layer
class CNNClassifier(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, 3)
self.maxpool = nn.MaxPool2d(kernel_size=3,padding=1)
self.conv2 = nn.Conv2d(16, 32, 3)
self.linear1 = nn.Linear(32*16*16, C)
self.softmax1 = nn.LogSoftmax(dim=1)
def forward(self, x):
x = self.conv1(x)
x = self.maxpool(F.leaky_relu(x))
x = self.conv2(x)
x = self.maxpool(F.leaky_relu(x))
x = self.linear1(x) # Size mismatch error HERE
x = self.softmax1(x)
return x
Training of the model is started by :
Xout = model(dat)
This results in :
RuntimeError: size mismatch, m1: [128 x 1568], m2: [8192 x 6]
at the linear1 input. What is needed here ? Note I have seen uses of wildcard input sizes e.g via a view:
..
x = x.view(x.size(0), -1)
x = self.linear1(x) # Size mismatch error HERE
If that is included then the error changes to
RuntimeError: size mismatch, m1: [28672 x 7], m2: [8192 x 6]
Some pointers on how to think about and calculate the cnn layer / linear layer input/output sizes would be much appreciated.
The error
You have miscalculated the output size from convolutional stack. It is actually [batch, 32, 7, 7] instead of [batch, 32, 16, 16].
You have to use reshape (or view) as output from Conv2d has 4 dimensions ([batch, channels, width, height]), while input to nn.Linear is required to have 2 dimensions ([batch, features]).
Use this for nn.Linear:
self.linear1 = nn.Linear(32 * 7 * 7, C)
And this in forward:
x = self.linear1(x.view(x.shape[0], -1))
Other possibilities
Current new architectures use pooling across channels (usually called global pooling). In PyTorch there is an torch.nn.AdaptiveAvgPool2d (or Max pooling). Using this approach allows you to have variable size of height and width of your input image as only one value per channel is used as input to nn.Linear. This is how it looks:
class CNNClassifier(torch.nn.Module):
def __init__(self, C=10):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, 3)
self.maxpool = nn.MaxPool2d(kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(16, 32, 3)
self.pooling = torch.nn.AdaptiveAvgPool2d(output_size=1)
self.linear1 = nn.Linear(32, C)
self.softmax1 = nn.LogSoftmax(dim=1)
def forward(self, x):
x = self.conv1(x)
x = self.maxpool(F.leaky_relu(x))
x = self.conv2(x)
x = self.maxpool(F.leaky_relu(x))
x = self.linear1(self.pooling(x).view(x.shape[0], -1))
x = self.softmax1(x)
return x
So now images of torch.Size([128, 3, 64, 64]) and torch.Size([128, 3, 128, 128]) can be passed to the network.
So the issue is with the way you defined the nn.Linear. You set the input size to 32*16*16 which is not the shape of the output image but the number 32/16 represent the number of "channels" dim that the Conv2d expect for the input and what it will output.
If you will add print(x.shape) before the entrance to the fully connected layer you will get:
torch.Size([Batch, 32, 7, 7])
So your calculation should have been 7*7*32:
self.linear1 = nn.Linear(32*7*7, C)
And then using:
x = x.view(x.size(0), -1)
x = self.linear1(x)
Will work perfectly fine. You can read about the what the view does in: How does the "view" method work in PyTorch?
I'm trying to make a CNN (still a beginner). When trying to fit the model I am getting this error:
ValueError: A target array with shape (10000, 10) was passed for output of shape (None, 6, 6, 10) while using as loss categorical_crossentropy. This loss expects targets to have the same shape as the output.
The shape of labels = (10000, 10)
the shape of the image data = (10000, 32, 32, 3)
Code:
import pickle
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Dense, Dropout, Activation, Flatten,
Conv2D, MaxPooling2D)
from tensorflow.keras.callbacks import TensorBoard
from keras.utils import to_categorical
import numpy as np
import time
MODEL_NAME = f"_________{int(time.time())}"
BATCH_SIZE = 64
class ConvolutionalNetwork():
'''
A convolutional neural network to be used to classify images
from the CIFAR-10 dataset.
'''
def __init__(self):
'''
self.training_images -- a 10000x3072 numpy array of uint8s. Each
a row of the array stores a 32x32 colour image.
The first 1024 entries contain the red channel
values, the next 1024 the green, and the final
1024 the blue. The image is stored in row-major
order, so that the first 32 entries of the array are the red channel values of the first row of the image.
self.training_labels -- a list of 10000 numbers in the range 0-9.
The number at index I indicates the label
of the ith image in the array data.
'''
# List of image categories
self.label_names = (self.unpickle("cifar-10-batches-py/batches.meta",
encoding='utf-8')['label_names'])
self.training_data = self.unpickle("cifar-10-batches-py/data_batch_1")
self.training_images = self.training_data[b'data']
self.training_labels = self.training_data[b'labels']
# Reshaping the images + scaling
self.shape_images()
# Converts labels to one-hot
self.training_labels = np.array(to_categorical(self.training_labels))
self.create_model()
self.tensorboard = TensorBoard(log_dir=f'logs/{MODEL_NAME}')
def unpickle(self, file, encoding='bytes'):
'''
Unpickles the dataset files.
'''
with open(file, 'rb') as fo:
training_dict = pickle.load(fo, encoding=encoding)
return training_dict
def shape_images(self):
'''
Reshapes the images and scales by 255.
'''
images = list()
for d in self.training_images:
image = np.zeros((32,32,3), dtype=np.uint8)
image[...,0] = np.reshape(d[:1024], (32,32)) # Red channel
image[...,1] = np.reshape(d[1024:2048], (32,32)) # Green channel
image[...,2] = np.reshape(d[2048:], (32,32)) # Blue channel
images.append(image)
for i in range(len(images)):
images[i] = images[i]/255
images = np.array(images)
self.training_images = images
print(self.training_images.shape)
def create_model(self):
'''
Creating the ConvNet model.
'''
self.model = Sequential()
self.model.add(Conv2D(64, (3, 3), input_shape=self.training_images.shape[1:]))
self.model.add(Activation("relu"))
self.model.add(MaxPooling2D(pool_size=(2,2)))
self.model.add(Conv2D(64, (3,3)))
self.model.add(Activation("relu"))
self.model.add(MaxPooling2D(pool_size=(2,2)))
# self.model.add(Flatten())
# self.model.add(Dense(64))
# self.model.add(Activation('relu'))
self.model.add(Dense(10))
self.model.add(Activation(activation='softmax'))
self.model.compile(loss="categorical_crossentropy", optimizer="adam",
metrics=['accuracy'])
def train(self):
'''
Fits the model.
'''
print(self.training_images.shape)
print(self.training_labels.shape)
self.model.fit(self.training_images, self.training_labels, batch_size=BATCH_SIZE,
validation_split=0.1, epochs=5, callbacks=[self.tensorboard])
network = ConvolutionalNetwork()
network.train()
Would appreciate the help, have been trying to fix for an hour.
You need to uncomment the Flatten layer when creating your model. Essentially what this layer does is that it takes a 4D input (batch_size, height, width, num_filters) and unrolls it into a 2D one (batch_size, height * width * num_filters). This is needed to get the output shape you want.
Un-comment the flatten layer before your output layer in create_model(self), conv layers don't work with 1D tensors/arrays, and so for you to get the output layer of the right shape to add a Flatten() layer right before your output layer, like this:
def create_model(self):
'''
Creating the ConvNet model.
'''
self.model = Sequential()
self.model.add(Conv2D(64, (3, 3), input_shape=self.training_images.shape[1:]), activation='relu')
#self.model.add(Activation("relu"))
self.model.add(MaxPooling2D(pool_size=(2,2)))
self.model.add(Conv2D(64, (3,3), activation='relu'))
#self.model.add(Activation("relu"))
self.model.add(MaxPooling2D(pool_size=(2,2)))
# self.model.add(Dense(64))
# self.model.add(Activation('relu'))
self.model.add(Flatten())
self.model.add(Dense(10, activation='softmax'))
#self.model.add(Activation(activation='softmax'))
self.model.compile(loss="categorical_crossentropy", optimizer="adam",
metrics=['accuracy'])
print ('model output shape:', self.model.output_shape)#prints out the output shape of your model
The code above will give you a model with an output shape of (None, 10).
Also please use activation as a layer parameter in the future.
Use model.summary() to inspect the output shapes of your model. Without the commented out Flatten() layer the shapes of your layers retain the original dimensions of the image and the shape of the output layer is (None, 6, 6, 10).
What you want to do here is roughly:
start with a shape of (batch_size, img width, img heigh, channels)
use convolutions to detect patterns through the image by applying a filter
reduce the img width and height with max pooling
then Flatten() the dimensions of the image so that instead of (width, heigh, features) you end up with just a set of features.
match against your classes.
The commented out code does step 4; when you remove the Flatten() layer you end up with the wrong set of dimensions at the end.
You have to get your model output into the same shape as your labels.
Perhaps the simplest solution would be to ensure the model ends with these layers:
model.add(Flatten())
## possibly an extra dense layer or 2 with 'relu' activation
model.add(Dense(10, activation=`softmax`))
This is amongst the most common 'endings' to a categorisation model and is arguably the most straightforward to understand.
It's not clear why you commented out this section:
# self.model.add(Flatten())
# self.model.add(Dense(64))
# self.model.add(Activation('relu'))
which would appear to give you the required output shape?