ANN implementation with Python OpenCV for handwriting recognition - python

There are 350 samples for each of 50 letters. Neural network has 3 layers. Input layer 400(20*20 images), hidden 200 and output 50. The training parameters I've used are:
max_steps = 1000
max_err = 0.000001
criteria = (condition, max_steps, max_err)
train_params = dict(term_crit = criteria,
bp_dw_scale = 0.1,
bp_moment_scale = 0.1)
What are the the optimal values I can use for this situation?

I fear you'll have to choose them manually by trial & error.
These values depend on lots of factors and, as far as I know, there's no formula to compute them. When I start training a new ANN, I just run it over and over again changing these parameters slightly each time.


Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
index = index + batch_size_train
if index > len(data):
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.
From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.
Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition :
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::

PyTorch Binary Classification - same network structure, 'simpler' data, but worse performance?

To get to grips with PyTorch (and deep learning in general) I started by working through some basic classification examples. One such example was classifying a non-linear dataset created using sklearn (full code available as notebook here)
n_pts = 500
X, y = datasets.make_circles(n_samples=n_pts, random_state=123, noise=0.1, factor=0.2)
x_data = torch.FloatTensor(X)
y_data = torch.FloatTensor(y.reshape(500, 1))
This is then accurately classified using a pretty basic neural net
class Model(nn.Module):
def __init__(self, input_size, H1, output_size):
self.linear = nn.Linear(input_size, H1)
self.linear2 = nn.Linear(H1, output_size)
def forward(self, x):
x = torch.sigmoid(self.linear(x))
x = torch.sigmoid(self.linear2(x))
return x
def predict(self, x):
pred = self.forward(x)
if pred >= 0.5:
return 1
return 0
As I have an interest in health data I then decided to try and use the same network structure to classify some a basic real-world dataset. I took heart rate data for one patient from here, and altered it so all values > 91 would be labelled as anomalies (e.g. a 1 and everything <= 91 labelled a 0). This is completely arbitrary, but I just wanted to see how the classification would work. The complete notebook for this example is here.
What is not intuitive to me is why the first example reaches a loss of 0.0016 after 1,000 epochs, whereas the second example only reaches a loss of 0.4296 after 10,000 epochs
Perhaps I am being naive in thinking that the heart rate example would be much easier to classify. Any insights to help me understand why this is not what I am seeing would be great!
Your input data is not normalized.
use x_data = (x_data - x_data.mean()) / x_data.std()
increase the learning rate optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
You'll get
convergence in only 1000 iterations.
More details
The key difference between the two examples you have is that the data x in the first example is centered around (0, 0) and has very low variance.
On the other hand, the data in the second example is centered around 92 and has relatively large variance.
This initial bias in the data is not taken into account when you randomly initialize the weights which is done based on the assumption that the inputs are roughly normally distributed around zero.
It is almost impossible for the optimization process to compensate for this gross deviation - thus the model gets stuck in a sub-optimal solution.
Once you normalize the inputs, by subtracting the mean and dividing by the std, the optimization process becomes stable again and rapidly converges to a good solution.
For more details about input normalization and weights initialization, you can read section 2.2 in He et al Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (ICCV 2015).
What if I cannot normalize the data?
If, for some reason, you cannot compute mean and std data in advance, you can still use nn.BatchNorm1d to estimate and normalize the data as part of the training process. For example
class Model(nn.Module):
def __init__(self, input_size, H1, output_size):
super().__init__() = nn.BatchNorm1d(input_size) # adding batchnorm
self.linear = nn.Linear(input_size, H1)
self.linear2 = nn.Linear(H1, output_size)
def forward(self, x):
x = torch.sigmoid(self.linear( # batchnorm the input x
x = torch.sigmoid(self.linear2(x))
return x
This modification without any change to the input data, yields similar convergance after only 1000 epochs:
A minor comment
For numerical stability, it is better to use nn.BCEWithLogitsLoss instead of nn.BCELoss. For this end, you need to remove the torch.sigmoid from the forward() output, the sigmoid will be computed inside the loss.
See, for example, this thread regarding the related sigmoid + cross entropy loss for binary predictions.
Let's start first by understanding how neural networks work, neural networks observe patterns, hence the necessity for large datasets. In the case of the example, two what pattern you intend to find is when if HR < 91: label = 0, this if-condition can be represented by the formula, sigmoid((HR-91) * 1) , if you plug various values into the formula you can see you that all values < 91, label 0 and others label 1. I have inferred this formula and it could be anything as long as it gives the correct values.
Basically, we apply the formula wx+b, where x in our input data and we learn the values for w and b. Now initially the values are all random, so getting the b value from 1030131190 (a random value), to maybe 98 is fast, since the loss is great, the learning rate allows the values to jump fast. But once you reach 98, your loss is decreasing, and when you apply the learning rate, it takes it more time to reach closer to 91, hence the slow decrease in loss. As the values get closer, the steps taken are even slower.
This can be confirmed via the loss values, they are constantly decreasing, initially, the deceleration is higher, but then it becomes smaller. Your network is still learning but slowly.
Hence in deep learning, you use this method called stepped learning rate, wherewith the increase in epochs you decrease your learning rate so that your learning is faster

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt =[optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at
By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.
I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).

High Train Set Accuracy, Low Test Set Accuracy, Tensorflow, Regularization, Dropout Tried

I am trying to build a model using Tensorflow NN.
Input Matrix Size: [6699, m] -> 'm' examples;
Output Matrix Size: [11, m] -> 11 output layer nodes with softmax implementation
I am consistently getting very high Train accuracy (>95%) and quite low Test accuracy (~20-30%). Some of the things I tried:
Increased training set size from around m = 1200 to m = 13000
Added Regularization (lambda = 0.5 & 0.7)
Added Dropouts (keep_prob = 0.5)
Gradient Descent and then Adam optimizer (which is useful only for speeding up the convergence)
Tried various values of learning rates
Tried changing number of layers and neurons in each layers. Again from shallow 1 hidden layer up till 6 hidden layers.
Currently using 1200 num_epochs, but the cost pretty much stabilizes after 500-600 epochs.
Tried some options on mini_batch size as well.
I have tried all these over last week or so. Still, I am unable to decipher what is causing so much divergence in training set and test set accuracies and I am getting this divergence for pretty much every scenario I described above. It's a clear case of overfitting and I am pretty much exhausted with the options.
Please suggest what more can I do.

Python neural network accuracy - correct implementation?

I wrote a simple neural net/MLP and I'm getting some strange accuracy values and wanted to double check things.
This is my intended setup: features matrix with 913 samples and 192 features (913,192). I'm classifying 2 outcomes, so my labels are binary and have shape (913,1). 1 hidden layer with 100 units (for now). All activations will use tanh and all losses use l2 regularization, optimized with SGD
The code is below. It was writtin in python with the Keras framework ( but my question isn't specific to Keras
input_size = 192
hidden_size = 100
output_size = 1
lambda_reg = 0.01
learning_rate = 0.01
num_epochs = 100
batch_size = 10
model = Sequential()
model.add(Dense(input_size, hidden_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Dense(hidden_size, output_size, W_regularizer=l2(lambda_reg), init='uniform'))
sgd = SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd, class_mode="binary")
history = History(), labels_all, batch_size=batch_size, nb_epoch=num_epochs, show_accuracy=True, verbose=2, validation_split=0.2, callbacks=[history])
score = model.evaluate(features_all, labels_all, show_accuracy=True, verbose=1)
I have 2 questions:
This is my first time using Keras, so I want to double check that the code I wrote is actually correct for what I want it to do in terms of my choice of parameters and their values etc.
Using the code above, I get training and test set accuracy hovering around 50-60%. Maybe I'm just using bad features, but I wanted to test to see what might be wrong, so I manually set all the labels and features to something that should be predictable:
labels_all[:500] = 1
labels_all[500:] = 0
features_all[:500] = np.ones(192)*500
features_all[500:] = np.ones(192)
So I set the first 500 samples to have a label of 1, everything else is labelled 0. I set all the features manually to 500 for each of the first 500 samples, and all other features (for the rest of the samples) get a 1
When I run this, I get training accuracy of around 65%, and validation accuracy around 0%. I was expecting both accuracies to be extremely high/almost perfect - is this incorrect? My thinking was that the features with extremely high values all have the same label (1), while the features with low values get a 0 label
Mostly I'm just wondering if my code/model is incorrect or whether my logic is wrong
I don't know that library, so I can't tell you if this is correctly implemented, but it looks legit.
I think your problem lies with activation function - tanh(500)=1 and tanh(1)=0.76. This difference seem too small for me. Try using -1 instead of 500 for testing purposes and normalize your real data to something about [-2, 2]. If you need full real numbers range, try using linear activation function. If you only care about positive half on real numbers, I propose softplus or ReLU. I've checked and all those functions are provided with Keras.
You can try thresholding your output too - answer 0.75 when expecting 1 and 0.25 when expecting 0 are valid, but may impact you accuracy.
Also, try tweaking your parameters. I can propose (basing on my own experience) that you'd use:
learning rate = 0.1
lambda in L2 = 0.2
number of epochs = 250 and bigger
batch size around 20-30
momentum = 0.1
learning rate decay about 10e-2 or 10e-3
I'd say that learning rate, number of epochs, momentum and lambda are the most important factors here - in order from most to least important.
PS. I've just spotted that you're initializing your weights uniformly (is that even a word? I'm not a native speaker...). I can't tell you why, but my intuition tells me that this is a bad idea. I'd go with random initial weights.

