Training letter images to a neural network with full-batch training

Training letter images to a neural network with full-batch training - python

According to this tutorial(Pure Python with NumPy), I want to build a simple(at simplest level for learning purpose) neural network(Perceptron) that can train to recognize "A" letter. In this tutorial, in the proposed example, they build a network that can learn "AND" logical operator. In this case, we have some inputs(4*3 Matrix) and one output(4*1 Matrix):
Each time we subtract output matrix with input matrix and calculate the error and updating rate and so on.
Now I want to give an image as an input, in this case, What will be my output? How can I define that image is an "A" letter? one solution is define "1" as "A" letter and "0" for "non-A" , But if my output is a scalar, How can I subtract it with hidden layer and calculate error and update weights? This tutorial uses "full-batch" training and multiply whole input matrix with weight matrix. I want to do with this method. The final destination is designing a neural net that can recognize "A" letter in the simplest form. I have no idea how to do this.

Fist off: Great that you try to understand neural networks by programming them from scratch, instead of starting of with some complex library. Let me try to clear things up: your understanding here:
Each time we subtract output matrix with input matrix and calculate the error and updating rate and so on.
is not really correct. In your example, the input matrix X is what you present to the input of your neural network. The output Y is what you want the network to do for X: the first element Y[0] is the desired output for the first row of X, and so on. We often call this the "target vector". Now to calculate the loss function (i.e. the error) we compare the output of the network (L2 in the linked example code), to the target vectorY. In words, we compare what we want the network to do (Y) to what it really does (L2). Then we make one step towards a direction which is closer to Y.
Now, if you want to use an image as the input, you should think of each pixel in the image as one input variable. Previously, we had two input variables: A and B, for which we wanted to calculate the term X = A ∧ B.
Example:
If we take a 8-by-8 pixel image, we have 8*8=64 input variables. Thus, our input matrix X should be a matrix with 65 columns (the 64 pixels of the image + 1 input as bias term, which is constantly =1) and one row per training example you have. E.g. if you have one image of each of the 26 letters, the matrix will contain 26 rows.
The output (target) vector Y should have the same length as X, i.e. 26 in the previous example. Each element in Y is 1 if the corresponding input row is an A, and 0 if it is another letter. In our example, Y[0] would be 1, Y[1:] would be 0.
Now, you can use the same code as before: the output L2 will be a vector containing the networks prediction, which you can then compare to Y as before.
tl;dr The key idea is to forget that an image is 2D, and store each input image as a vector.

Related

Multiple linear regression: appending an array on ones to Matrix of features (Python)

I'm currently learning basics of Data Science online. In one of the session on Multiple Linear Regression using Python, the tutor executed below step to add an array on ones to the Matrix of features ; I did not understand why it is being added. From online forums, it is mentioned that it is added so that model (equation) have a constant offset. But why 1 and not any other values. Does the number of independent variables (3) have any impact on this value
X -> Matrix of features ; number of rows in data set : 50 ; Number of
X = np.append(arr = np.ones([50,1]).astype(int), values = X,axis=1)

To better explain, let's imagine you have only 1 feature stored, and let's say 3
training examples.
Then, your parameters are:
And your input variables are:
If you want to realize a linear classification, you must compute the cost function for each training example i:
And if you need to vectorize the calculus (for efficiency and code readability), you want to compute the following matricial product:
However, by definition of the matricial product, the number of columns of matrix X should be the same than the number of rows of matrix Theta. Thus, to compute the product but leave the result unchanged, you add a column of ones to the left of matrix X:
Then, the result for each sample i is the following:
TLDR: You need to append a column of ones to X for the matricial product X*Theta to be defined. If you were adding any other coefficient c instead of 1, then your constant offset theta_0 would be multiplied by your coefficient c.

I think the cost function is the summation of errors between the predicted label and the actual label which we want to minimise. The J function given above is the hypothesis function.

Scaling of backpropagation

I am following this tutorial on NN and backpropagation.
I am new to python and I am trying to convert the code to MATLAB.
Can someone kindly explain the following code line (from the tutorial) :
delta3[range(num_examples), y] -= 1
In short, and if I am not mistaken, delta3 and y are vectors and num_examples is an integer.
Ii is my understanding that delta3=probs-y as in this math exchange entry(Thank you #rayryeng). Why and when should I subtract 1?
Otherwise can anybody direct me to an online site I can simply run and follow the code? I was getting errors everywhere I tried to run (including my home PC):
"NameError: name 'sklearn' is not defined" (probably an import I am missing)

This line: delta3[range(num_examples), y] -= 1 is part of calculating the gradient of the softmax loss function. I refer you to this nice link that gives you more information on how this loss function is formulated and the intuition behind it: http://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/.
In addition, I refer you to this post on Mathematics Stack Exchange that shows you how the gradient of the softmax loss is derived: https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function. Consider the first link as a deep dive whereas the second link is a tl;dr of the first link.
The gradient of the softmax loss function is the gradient of the output layer which you would need to propagate backwards into the layer before the output layer to continue with the backpropagation algorithm.
Summarizing the post I have linked above, if you calculate the gradient of the softmax loss for a training example, for each class the gradient of the loss is simply the softmax value evaluated for that class. You additionally need to subtract the loss value by 1 for the class the actual training example belongs to. Remember that the gradient of an example for a class i is equal to p_i - y_i where p_i is the softmax score of class i for the example and y_i is the classification label using a one-hot encoding scheme. Specifically y_i = 0 if i is not the true class of the example and y_i = 1 if it is. delta3 contains the gradient of the softmax loss function per example in your mini-batch. Specifically, it is a 2D matrix where the total number of rows is equal to the number of training examples, or num_examples while the number of columns is the total number of classes.
Firstly we calculate the softmax scores for each training example and for each class. Next for each row of the gradient, we determine the column location that corresponds to the true class the example belongs to and we subtract the scores by 1. range(num_examples) would generate a list from 0 up to num_examples - 1 and y contains the true class labels per example. Therefore, for each pair of range(num_examples) and y, this accesses the right row and column location to subtract 1 by to finalize the gradient of the loss function.
Now in the Mathematics Stack Exchange post as well as your understanding, the gradient is delta3 = probs - y. This assumes that y is a one-hot encoded matrix, meaning that y has the same size as probs and for each row of y it is all zero except for the column index that contains the correct class which is set to 1. Therefore if you think about it correctly, if you generated a matrix y where for each row the columns are all zero except for the class number that example belongs to, it is equivalent to simply accessing the right column for each row and subtracting the score by 1.
In MATLAB you actually need to create the linear indices in order to facilitate this subtraction. Specifically, you need to use sub2ind to convert these row and column locations to linear indices, then we can access the gradient matrix and subtract the values by 1.
Therefore:
ind = sub2ind(size(delta3), 1 : num_examples, y + 1);
delta3(ind) = delta3(ind) - 1;
In the Python tutorial you have linked, the class labels are assumed to be from 0 up to N-1 where N is the total number of classes. You must be careful in MATLAB where we start indexing arrays starting at 1, so I have added 1 to y in the above code to ensure that your labels start at 1 instead of 0. ind contains the linear indices of the row and column locations that we need to access and we thus complete the subtraction using those indices.
If you were to formulate this using the knowledge that you gained from your edit, you would do this instead:
ymatrix = full(sparse(1 : num_examples, y + 1, 1, size(delta3, 1), size(delta3, 2));
delta3 = probs - ymatrix;
ymatrix contains the matrix that I talked about where each row corresponds to an example with all zeroes except for the column that pertains to the class the example belongs to, which is 1. What you may have not seen before is the sparse and full functions. sparse allows you to create a zero matrix and you can specify the row and column locations that are non-zero as well as the values that each of these locations take on. In this case, I'm exactly accessing one element per row and using the class ID for the example to access the columns and setting each of these locations to 1. Also remember that I'm adding by 1 as I'm assuming your class IDs start from 0. Because this is a sparse matrix, I then convert this to full to give you a numeric matrix rather than representing it in sparse form. Therefore, this code is equivalent in operation to the previous code snippet I showed. However, it is more efficient to do it the first way as you are not creating an additional matrix to facilitate the gradient computation. You are modifying the gradient in place instead.
As a sidenote, sklearn is the scikit-learn Python machine learning package, and the NameError is in reference to you not having the actual package installed. To install it, use pip or easy_install to install the Python package to your computer.... so in your command line, it's as simple as:
pip install sklearn
or:
easy_install sklearn
However, scikit-learn should not be required for you to run the above subtraction code. You do need NumPy though so make sure you have that package installed.
For pip:
pip install numpy
... and for easy_install:
easy_install numpy

python sklearn: what is the different between "sklearn.preprocessing.normalize(X, norm='l2')" and "sklearn.svm.LinearSVC(penalty='l2')"

here is two method of normalize :
1:this one is using in the data Pre-Processing: sklearn.preprocessing.normalize(X, norm='l2')
2:the other method is using in the classify method : sklearn.svm.LinearSVC(penalty='l2')
i want to know ,what is the different between them? and does the two step must be used in a completely model ? is it right that just use a method is enough?

These 2 are different things and you normally need them both in order to make a good SVC model.
1) The first one means that in order to scale (normalize) the X data matrix you need to divide with the L2 norm of each column, which is just this : sqrt(sum(abs(X[:,j]).^2)) , where j is each column in your data matrix X . This ensures that none of the values of each column become too big, which makes it tough for some algorithms to converge.
2) Irrespective of how scaled (and small in values) your data is, there still may be outliers or some features (j) that are way too dominant and your algorithm (LinearSVC()) may over trust them while it shouldn't. This is where L2 regularization comes into play , that says apart from the function the algorithm minimizes, a cost will be applied to the coefficients so that they don't become too big . In other words the coefficients of the model become additional cost for the SVR cost function. How much cost ? is decided by the C (L2) value as C*(beta[j])^2
To sum up, first one tells with which value to divide each column of the X matrix. How much weight should a coefficient burden the cost function with is the second.

How to change parameters of a scikit learn function dynamically i.e. find best parameter

I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?

You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)

It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.

Indexing a tensor in the 3rd dimension

I have a batch of N sequences of integers of length L which is embedded into a N*L*d tensor. This sequence is auto-encoded by my network architecture. So, I have:
from theano import tensor as T
X = T.imatrix('X') # N*L elements in [0,C]
EMB = T.tensor('Embedding') # N*L*d
... # some code goes here :-)
PY = T.tensor('PY') # N*L*C probability of the predicted class in [0,C]
cost = -T.log(PY[X])
as far as I could get, the indexing is in the first dimension of the tensor, so I had to use a theano.scan. Is there a way to index the tensor directly?

Sounds like you want a 3 dimensional version of theano.tensor.nnet.categorical_crossentropy?
If so, then I think you could simply flatten the matrix of true class label indexes into a vector and the 3D tensor of predicted class probabilities into a matrix and then use the built in function.
cost = T.nnet.categorical_crossentropy(
Y.reshape((Y.shape[0] * Y.shape[1], X.shape[2])),
X.flatten())
The order of entries in Y may need to be adjusted first (e.g. via a dimshuffle) to make sure the entries in the matrix and vector being compared correspond to each other.
Here we assume, as the question suggests, that the sequences are not padded -- they are all exactly L elements in length. If the sequences are actually padded then you may need to do something much more complicated to avoid computing cost elements inside the padding regions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.