Scaling of backpropagation

Scaling of backpropagation - python

I am following this tutorial on NN and backpropagation.
I am new to python and I am trying to convert the code to MATLAB.
Can someone kindly explain the following code line (from the tutorial) :
delta3[range(num_examples), y] -= 1
In short, and if I am not mistaken, delta3 and y are vectors and num_examples is an integer.
Ii is my understanding that delta3=probs-y as in this math exchange entry(Thank you #rayryeng). Why and when should I subtract 1?
Otherwise can anybody direct me to an online site I can simply run and follow the code? I was getting errors everywhere I tried to run (including my home PC):
"NameError: name 'sklearn' is not defined" (probably an import I am missing)

This line: delta3[range(num_examples), y] -= 1 is part of calculating the gradient of the softmax loss function. I refer you to this nice link that gives you more information on how this loss function is formulated and the intuition behind it: http://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/.
In addition, I refer you to this post on Mathematics Stack Exchange that shows you how the gradient of the softmax loss is derived: https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function. Consider the first link as a deep dive whereas the second link is a tl;dr of the first link.
The gradient of the softmax loss function is the gradient of the output layer which you would need to propagate backwards into the layer before the output layer to continue with the backpropagation algorithm.
Summarizing the post I have linked above, if you calculate the gradient of the softmax loss for a training example, for each class the gradient of the loss is simply the softmax value evaluated for that class. You additionally need to subtract the loss value by 1 for the class the actual training example belongs to. Remember that the gradient of an example for a class i is equal to p_i - y_i where p_i is the softmax score of class i for the example and y_i is the classification label using a one-hot encoding scheme. Specifically y_i = 0 if i is not the true class of the example and y_i = 1 if it is. delta3 contains the gradient of the softmax loss function per example in your mini-batch. Specifically, it is a 2D matrix where the total number of rows is equal to the number of training examples, or num_examples while the number of columns is the total number of classes.
Firstly we calculate the softmax scores for each training example and for each class. Next for each row of the gradient, we determine the column location that corresponds to the true class the example belongs to and we subtract the scores by 1. range(num_examples) would generate a list from 0 up to num_examples - 1 and y contains the true class labels per example. Therefore, for each pair of range(num_examples) and y, this accesses the right row and column location to subtract 1 by to finalize the gradient of the loss function.
Now in the Mathematics Stack Exchange post as well as your understanding, the gradient is delta3 = probs - y. This assumes that y is a one-hot encoded matrix, meaning that y has the same size as probs and for each row of y it is all zero except for the column index that contains the correct class which is set to 1. Therefore if you think about it correctly, if you generated a matrix y where for each row the columns are all zero except for the class number that example belongs to, it is equivalent to simply accessing the right column for each row and subtracting the score by 1.
In MATLAB you actually need to create the linear indices in order to facilitate this subtraction. Specifically, you need to use sub2ind to convert these row and column locations to linear indices, then we can access the gradient matrix and subtract the values by 1.
Therefore:
ind = sub2ind(size(delta3), 1 : num_examples, y + 1);
delta3(ind) = delta3(ind) - 1;
In the Python tutorial you have linked, the class labels are assumed to be from 0 up to N-1 where N is the total number of classes. You must be careful in MATLAB where we start indexing arrays starting at 1, so I have added 1 to y in the above code to ensure that your labels start at 1 instead of 0. ind contains the linear indices of the row and column locations that we need to access and we thus complete the subtraction using those indices.
If you were to formulate this using the knowledge that you gained from your edit, you would do this instead:
ymatrix = full(sparse(1 : num_examples, y + 1, 1, size(delta3, 1), size(delta3, 2));
delta3 = probs - ymatrix;
ymatrix contains the matrix that I talked about where each row corresponds to an example with all zeroes except for the column that pertains to the class the example belongs to, which is 1. What you may have not seen before is the sparse and full functions. sparse allows you to create a zero matrix and you can specify the row and column locations that are non-zero as well as the values that each of these locations take on. In this case, I'm exactly accessing one element per row and using the class ID for the example to access the columns and setting each of these locations to 1. Also remember that I'm adding by 1 as I'm assuming your class IDs start from 0. Because this is a sparse matrix, I then convert this to full to give you a numeric matrix rather than representing it in sparse form. Therefore, this code is equivalent in operation to the previous code snippet I showed. However, it is more efficient to do it the first way as you are not creating an additional matrix to facilitate the gradient computation. You are modifying the gradient in place instead.
As a sidenote, sklearn is the scikit-learn Python machine learning package, and the NameError is in reference to you not having the actual package installed. To install it, use pip or easy_install to install the Python package to your computer.... so in your command line, it's as simple as:
pip install sklearn
or:
easy_install sklearn
However, scikit-learn should not be required for you to run the above subtraction code. You do need NumPy though so make sure you have that package installed.
For pip:
pip install numpy
... and for easy_install:
easy_install numpy

Related

How do I properly use shap decision plots and force plots with multiple regression targets?

I have a Keras neural network with 26 features and 100 targets I want to explain with the SHAP python library.
In order to plot the force plot, for instance, I do:
shap.force_plot(exp.expected_value[i], shap_values[j][k], x_val.columns)
Where:
exp.expected_values is a list of size 100 with the base values for each of my targets (this is at least what I understand). The index i refers to the i-th target, I assume.
shap_values refers to the Shapley values of all the features for each of the targets in each validation case. Therefore, j runs from 0 to 99 (i.e. the size of my targets) and k runs from 0 to the total number of validation cases.
What I find confusing is that i and j can actually be different and I do get a plot that looks OK. However, shouldn't they always be the same index? Shouldn't the i-th baseline target always be compared to the shap values of the i-th target?
Am I understanding the indices wrong?

i and j should be the same, because you're plotting how ith target is affected by features, from base to predicted:
shap.force_plot(exp.expected_value[i], shap_values[i][k], x_val.columns)
where:
i stands for ith target class
k stands for kth sample to be explained.
The reason behind is exp.expected_value will be of shape num_targets and they will be base values for shap values to be added to, and shap values should be of shape [num_classes, num_samples, num_features], if converted to numpy array.
So, e.g., to get shap values for kth datapoint in raw space, one would do:
shap_values[:,k,:].sum(1) + base_values
and for models using softmax to get to probability space one would do:
softmax(shap_values[:,k,:].sum(1) + base_values)
Note, this is assuming shap_values are of numpy array type.
Please ask if something is not clear.

Multiple linear regression: appending an array on ones to Matrix of features (Python)

I'm currently learning basics of Data Science online. In one of the session on Multiple Linear Regression using Python, the tutor executed below step to add an array on ones to the Matrix of features ; I did not understand why it is being added. From online forums, it is mentioned that it is added so that model (equation) have a constant offset. But why 1 and not any other values. Does the number of independent variables (3) have any impact on this value
X -> Matrix of features ; number of rows in data set : 50 ; Number of
X = np.append(arr = np.ones([50,1]).astype(int), values = X,axis=1)

To better explain, let's imagine you have only 1 feature stored, and let's say 3
training examples.
Then, your parameters are:
And your input variables are:
If you want to realize a linear classification, you must compute the cost function for each training example i:
And if you need to vectorize the calculus (for efficiency and code readability), you want to compute the following matricial product:
However, by definition of the matricial product, the number of columns of matrix X should be the same than the number of rows of matrix Theta. Thus, to compute the product but leave the result unchanged, you add a column of ones to the left of matrix X:
Then, the result for each sample i is the following:
TLDR: You need to append a column of ones to X for the matricial product X*Theta to be defined. If you were adding any other coefficient c instead of 1, then your constant offset theta_0 would be multiplied by your coefficient c.

I think the cost function is the summation of errors between the predicted label and the actual label which we want to minimise. The J function given above is the hypothesis function.

Training letter images to a neural network with full-batch training

According to this tutorial(Pure Python with NumPy), I want to build a simple(at simplest level for learning purpose) neural network(Perceptron) that can train to recognize "A" letter. In this tutorial, in the proposed example, they build a network that can learn "AND" logical operator. In this case, we have some inputs(4*3 Matrix) and one output(4*1 Matrix):
Each time we subtract output matrix with input matrix and calculate the error and updating rate and so on.
Now I want to give an image as an input, in this case, What will be my output? How can I define that image is an "A" letter? one solution is define "1" as "A" letter and "0" for "non-A" , But if my output is a scalar, How can I subtract it with hidden layer and calculate error and update weights? This tutorial uses "full-batch" training and multiply whole input matrix with weight matrix. I want to do with this method. The final destination is designing a neural net that can recognize "A" letter in the simplest form. I have no idea how to do this.

Fist off: Great that you try to understand neural networks by programming them from scratch, instead of starting of with some complex library. Let me try to clear things up: your understanding here:
Each time we subtract output matrix with input matrix and calculate the error and updating rate and so on.
is not really correct. In your example, the input matrix X is what you present to the input of your neural network. The output Y is what you want the network to do for X: the first element Y[0] is the desired output for the first row of X, and so on. We often call this the "target vector". Now to calculate the loss function (i.e. the error) we compare the output of the network (L2 in the linked example code), to the target vectorY. In words, we compare what we want the network to do (Y) to what it really does (L2). Then we make one step towards a direction which is closer to Y.
Now, if you want to use an image as the input, you should think of each pixel in the image as one input variable. Previously, we had two input variables: A and B, for which we wanted to calculate the term X = A ∧ B.
Example:
If we take a 8-by-8 pixel image, we have 8*8=64 input variables. Thus, our input matrix X should be a matrix with 65 columns (the 64 pixels of the image + 1 input as bias term, which is constantly =1) and one row per training example you have. E.g. if you have one image of each of the 26 letters, the matrix will contain 26 rows.
The output (target) vector Y should have the same length as X, i.e. 26 in the previous example. Each element in Y is 1 if the corresponding input row is an A, and 0 if it is another letter. In our example, Y[0] would be 1, Y[1:] would be 0.
Now, you can use the same code as before: the output L2 will be a vector containing the networks prediction, which you can then compare to Y as before.
tl;dr The key idea is to forget that an image is 2D, and store each input image as a vector.

python sklearn: what is the different between "sklearn.preprocessing.normalize(X, norm='l2')" and "sklearn.svm.LinearSVC(penalty='l2')"

here is two method of normalize :
1:this one is using in the data Pre-Processing: sklearn.preprocessing.normalize(X, norm='l2')
2:the other method is using in the classify method : sklearn.svm.LinearSVC(penalty='l2')
i want to know ,what is the different between them? and does the two step must be used in a completely model ? is it right that just use a method is enough?

These 2 are different things and you normally need them both in order to make a good SVC model.
1) The first one means that in order to scale (normalize) the X data matrix you need to divide with the L2 norm of each column, which is just this : sqrt(sum(abs(X[:,j]).^2)) , where j is each column in your data matrix X . This ensures that none of the values of each column become too big, which makes it tough for some algorithms to converge.
2) Irrespective of how scaled (and small in values) your data is, there still may be outliers or some features (j) that are way too dominant and your algorithm (LinearSVC()) may over trust them while it shouldn't. This is where L2 regularization comes into play , that says apart from the function the algorithm minimizes, a cost will be applied to the coefficients so that they don't become too big . In other words the coefficients of the model become additional cost for the SVR cost function. How much cost ? is decided by the C (L2) value as C*(beta[j])^2
To sum up, first one tells with which value to divide each column of the X matrix. How much weight should a coefficient burden the cost function with is the second.

Calculate Hits At metric in Theano

I am using keras to build a recommender model. Because the item set is quite large, I'd like to calculate the Hits # N metric as a measure of accuracy. That is, if the observed item is in the top N predicted, it counts as relevant recommendation.
I was able to build the hits at N function using numpy. But as I'm trying to port it into a custom loss function for keras, I'm having problem with the tensors. Specifically, enumerating over a tensor is different. And when I looked into the syntax to find something equivalent, I started to question the whole approach. It's sloppy and slow, reflective of my general python familiarity.
def hits_at(y_true, y_pred): #numpy version
a=y_pred.argsort(axis=1) #ascending, sort by row, return index
a = np.fliplr(a) #reverse to get descending
a = a[:,0:10] #return only the first 10 columns of each row
Ybool = [] #initialze 2D arrray
for t, idx in enumerate(a):
ybool = np.zeros(num_items +1) #zero fill; 0 index is reserved
ybool[idx] = 1 #flip the recommended item from 0 to 1
Ybool.append(ybool)
A = map(lambda t: list(t), Ybool)
right_sum = (A * y_true).max(axis=1) #element-wise multiplication, then find the max
right_sum = right_sum.sum() #how many times did we score a hit?
return right_sum/len(y_true) #fraction of observations where we scored a hit
How should I approach this in a more compact, and tensor-friendly way?
Update:
I was able to get a version of Top 1 working. I based it loosely on the GRU4Rec description
def custom_objective(y_true, y_pred):
y_pred_idx_sort = T.argsort(-y_pred, axis=1)[:,0] #returns the first element, which is the index of the row with the largest value
y_act_idx = T.argmax(y_true, axis=1)#returns an array of indexes with the top value
return T.cast(-T.mean(T.nnet.sigmoid((T.eq(y_pred_idx_sort,y_act_idx)))), theano.config.floatX)`
I just had to compare the array of top 1 predictions to the array of the actuals element-wise. And Theano has an eq() function to do that.

Independent of N, the number of possible values of your loss function is finite. Therefore it can't be differentiable in a sensible tensor way and you cannot use it as loss function in Keras / Theano. You may try to use a theano log loss with top N guys.
UPDATE :
In Keras - you may write your own loss functions. They have a declaration of a form :
def loss_function(y_pred, y_true):
Both y_true and y_pred are numpy arrays, so you may obtain easly a vector v which is 1 when an example given is in top 500 and 0 otherwise. Then you may transform it to theano tensor constant vector and apply it in a way :
return theano.tensor.net.binary_crossentropy(y_pred * v, y_true * v)
This should work correctly.
UPDATE 2:
Log loss is the same thing what binary_crossentropy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.