I am using keras to build a recommender model. Because the item set is quite large, I'd like to calculate the Hits # N metric as a measure of accuracy. That is, if the observed item is in the top N predicted, it counts as relevant recommendation.
I was able to build the hits at N function using numpy. But as I'm trying to port it into a custom loss function for keras, I'm having problem with the tensors. Specifically, enumerating over a tensor is different. And when I looked into the syntax to find something equivalent, I started to question the whole approach. It's sloppy and slow, reflective of my general python familiarity.
def hits_at(y_true, y_pred): #numpy version
a=y_pred.argsort(axis=1) #ascending, sort by row, return index
a = np.fliplr(a) #reverse to get descending
a = a[:,0:10] #return only the first 10 columns of each row
Ybool = [] #initialze 2D arrray
for t, idx in enumerate(a):
ybool = np.zeros(num_items +1) #zero fill; 0 index is reserved
ybool[idx] = 1 #flip the recommended item from 0 to 1
Ybool.append(ybool)
A = map(lambda t: list(t), Ybool)
right_sum = (A * y_true).max(axis=1) #element-wise multiplication, then find the max
right_sum = right_sum.sum() #how many times did we score a hit?
return right_sum/len(y_true) #fraction of observations where we scored a hit
How should I approach this in a more compact, and tensor-friendly way?
Update:
I was able to get a version of Top 1 working. I based it loosely on the GRU4Rec description
def custom_objective(y_true, y_pred):
y_pred_idx_sort = T.argsort(-y_pred, axis=1)[:,0] #returns the first element, which is the index of the row with the largest value
y_act_idx = T.argmax(y_true, axis=1)#returns an array of indexes with the top value
return T.cast(-T.mean(T.nnet.sigmoid((T.eq(y_pred_idx_sort,y_act_idx)))), theano.config.floatX)`
I just had to compare the array of top 1 predictions to the array of the actuals element-wise. And Theano has an eq() function to do that.
Independent of N, the number of possible values of your loss function is finite. Therefore it can't be differentiable in a sensible tensor way and you cannot use it as loss function in Keras / Theano. You may try to use a theano log loss with top N guys.
UPDATE :
In Keras - you may write your own loss functions. They have a declaration of a form :
def loss_function(y_pred, y_true):
Both y_true and y_pred are numpy arrays, so you may obtain easly a vector v which is 1 when an example given is in top 500 and 0 otherwise. Then you may transform it to theano tensor constant vector and apply it in a way :
return theano.tensor.net.binary_crossentropy(y_pred * v, y_true * v)
This should work correctly.
UPDATE 2:
Log loss is the same thing what binary_crossentropy.
Related
I am writing a program that utilizes NumPy to calculate accuracy between testing and training points, but I am not sure how to utilize the vectorized functions as opposed to the for loops I have used in my code.
Here is my code(Is there a way to simply the code so that I do not need any loops?)
ty#command to import NumPy package
import numpy as np
iris_train=np.genfromtxt("iris-train-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
iris_test=np.genfromtxt("iris-test-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
train_cat=np.genfromtxt("iris-training-data.csv",delimiter=',',usecols=(4),dtype=str)
test_cat=np.genfromtxt("iris-testing-data.csv",delimiter=',',usecols=(4),dtype=str)
correct = 0
for i in range(len(iris_test)):
n = 0
old_distance = float('inf')
while n < len(iris_train):
#finding the difference between test and train point
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
#summing up the calculated differences
iris_sum = sum(iris_diff)
new_distance = float(np.sqrt(iris_sum))
#if statement to update distance
if new_distance < old_distance:
index = n
old_distance = new_distance
n += 1
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1
accuracy = ((correct)/float((len(iris_test)))*100)
print(f"Accuracy:{accuracy: .2f}%")pe here
:
The trick with computing the distances is to insert extra dimensions using numpy.newaxis and use broadcasting to compute a matrix with the distance from every testing sample to every training sample in one vectorized operation. Using numpy's broadcasting rules, diff has shape (num_test_samples, num_train_samples, num_features), and distance has shape (num_test_samples, num_train_samples) since we summed along the last axis in the call to numpy.sum.
Then you can use numpy.argmin to find the index of the closest training sample for every testing sample. index has shape (num_test_samples, ) since we did the reduction operation along the last axis of distance.
Finally, you can use index to select the training classification closest
to the testing classification. We can construct a boolean array that represents the equality between the testing classification and the closest training classification using the == operator. The number of correct classifications is then the sum of the True elements of this boolean array. Since True is casted to 1 and False is casted to 0 we can simply sum this boolean array to get the number of correct classifications.
# Compute the distance from every training sample to every testing sample
# Note that `np.sqrt` is not necessary since sqrt is a monotonically
# increasing function -- removing it doesn't change the answer
diff = iris_test[:, np.newaxis] - iris_train[np.newaxis, :]
distance = np.sqrt(np.sum(np.square(diff), axis=-1))
# Compute the index of the closest training sample to the testing sample
index = np.argmin(distance, axis=-1)
# Check if class of the closest training sample matches the class
# of the testing sample
correct = (test_cat == train_cat[index]).sum()
If I get correctly what you are doing (but I don't really need to, to answer the question), for each vector of iris_test, you are searching for the closest one in isis_train. Closest being here in the sense of euclidean distance.
So you have 3 embedded loop (pseudo-python)
for u in iris_test:
for v in iris_train:
s=0
for i in range(dimensionOfVectors):
s+=(iris_test[i]-iris_train[i])**2
dist=sqrt(s)
You are right to try to get rid of python loops. And the most important one to get rid of is the inner one. And you already got rid of this one. Since the inner loop of my pseudo code is, in your code, implicitly in:
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
and
iris_sum = sum(iris_diff)
Both those line iterates through all dimensions of your vectors. But do it not in python but in internal numpy code, so it is fast.
One may object that you don't really need abs after a **2, that you could have called the np.linalg.norm function that does all those operations in one call
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
which is faster than your code. But at least, in your code, that loop over all components of the vectors is already vectorized.
The next stage is to vectorize the middle loop.
That also can be accomplished. Instead of computing one by one
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
You could compute in one call all the len(iris_train) distances between iris_test[i] and all iris_train[n].
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
The trick here lies in numpy broadcasting and axis parameter
broadcasting means that you can compute the difference between a 1D, length W vector, and a 2D n×W array (iris_test[0] is a 1D vector, and iris_train is 2D-array whose number of columns is the same as the length of iris_test[0]). Because in such case, numpy broadcasts the 1st operator, and returns a 2D n×W array as result, whose each line k is iris_test[0] - iris_train[k].
Calling np.linalg.norm on that n×W 2D matrix would return a single float (the norm of the whole matrix). Unless you restrict the norm to the 2nd axis (axis=1). In which case, it returns n floats, each of them being the norm of one row.
In other words, after the previous line of code, new_distances[k] is the distance between iris_test[i] and iris_train[k].
Once that done, you can easily find k such as this distance is the smallest, using np.argmin.
np.argmin(new_distances) is the index of the smallest of the distances.
So, all together, your code could be rewritten as:
correct = 0
for i in range(len(iris_test)):
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
index=np.argmin(new_distances)
#printing out classifications
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1
The custom loss function which we are designing does contain few non differentiable operations, e.g. histogram creation, and counting values greater than a threshold. We were hoping that Pytorch Autograd can automatically generate approximate derivatives for these operations.
Let me givean overview of our attempts for both of these and request valuable insights into whether these workarounds have any chance of working with Pytorch Autograd or any other Pytorch compatible library.
Operation 1: Counting values greater than a threshold.
We need the count of values which are greater than a threshold per row, We have tried to come up with a workaround but we think probably it may not work based on your reply.
The workaround :
#In = [[1,2,3,4,5,6,7],
[1,5,4,2,6,11,2],
[0,0,3,4,8,7,11]]
#th = 5
#then,
#Out = [[2],[2],[3]] # Required
Diff = 1-In
Required_Indexes = In>=th
NotRequired_Indexes = In<th
In[NotRequired_Indexes] = 0 #Forcing the values to 0 which are less than threshold, gradients won't flow for these values but that might be ok if gradients of other values can flow in this situation
Diff[NotRequired_Indexes] = 0
In = In + Diff
Out = torch.sum(In,1) # Desired Output
Should this work? Or there is a better and clean way to do it? Or, any such workarounds are doomed to fail with Pytorch Autograd atleast?
Operation 2: Histogram .
We need to take weighted sum of histogram of a tensor. As histogram is not differentiable the workaround we have used is :
We already know the range of values in the tensor (named OutTensor below):
New_empty_A # Tensor of same size as OutTensor
New_empty_B # Tensor of same size as OutTensor
Weights # Contains Weights of each bin at it's location
# For example, if OutTensor can have only 3 values: 0,1,2 then:
OutTensor = OutTensor + 1 #Shift the values by 1 first to get rid of 0s
Weights = [0,weight_0/1,weight_1/2,weight_2/3]
for UniqueValue in OutTensor: # Run a loop through all possible values of OutTensor
New_empty_A [OutTensor==UniqueValues] = Weights[UniqueValue]
New_empty_B = New_empty_A*OutTensor # What we expect is we have already divided weight by it's corresponding value so after this operation in New_empty_B there will be actual weights and we can simply sum it.
Final_Out = torch.sum(New_empty_B)
Does this workaround make any sense with the mechanics of Autograd?
Good night,
I would like to use Nearest Neighbor model for Regression with non-uniform weights. I saw in the User Guide that I can use weights='distance' in the declaration of the model and then the weights would be inverse proportional to the distance, but the results I get were not what I wanted.
I saw in the Documentation that I could use a function for the weights (given the distances) used in the prediction, so I have created the follow function:
from sklearn.neighbors import KNeighborsRegressor
import numpy
nparray = numpy.array
def customized_weights(distances: nparray)->nparray:
for distance in distances:
if (distance >= 100 or distance <= -100):
yield 0
yield (1 - abs(distance)/100)
And have declared the method like this:
knn: KNeighborsRegressor = KNeighborsRegressor(n_neighbors=50, weights=customized_weights ).fit(X_train, y_train)
Until that part, everything works fine. But when I tried to predict with the model, I get the error:
File "knn_with_weights.py", line 14, in customized_weights
if (distance >= 100 or distance <= -100):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I did not understand what I did wrong. On the Documentation it is written that my function should has an array of distances as parameter and should return the equivalent weights. What have I done wrong?
Thanks in advance.
I don't know much about this type of regression, but it is certainly possible that the distances passed into this is a 2-dimensional data structure, which would make sense for all pairwise-distances.
Why don't you put a little teaser print statement into your custom function to print both distances and distances.shape
The #Jeff H's tip directed me to the answer.
The input parameter of this function is a two dimensional numpy array distances with shape (predictions, neighbors), where:
predictions is the number of desired predictions (when you call knn.predict(X_1, X_2, X_3, ...);
neighbors, the number of neighbors used (in my case, n_neighbors=50).
Each element distances[i, j] represents the distance for the i prediction, from the j nearest neighbor (the smaller j, smaller the distance).
The function must return an array with the same dimensions as the input array, with the weight corresponding to each distance.
I do not know if it is the fastest way, but I came up with this solution:
def customized_weights(distances: nparray)->nparray:
weights: nparray = nparray(numpy.full(distances.shape, 0), dtype='float')
# create a new array 'weights' with the same dimension of 'distances' and fill
# the array with 0 element.
for i in range(distances.shape[0]): # for each prediction:
if distances[i, 0] >= 100: # if the smaller distance is greather than 100,
# consider the nearest neighbor's weight as 1
# and the neighbor weights will stay zero
weights[i, 0] = 1
# than continue to the next prediction
continue
for j in range(distances.shape[1]): # aply the weight function for each distance
if (distances[i, j] >= 100):
continue
weights[i, j] = 1 - distances[i, j]/100
return weights
I am having trouble understanding the output of my function to implement multiple-ridge regression. I am doing this from scratch in Python for the closed form of the method. This closed form is shown below:
I have a training set X that is 100 rows x 10 columns and a vector y that is 100x1.
My attempt is as follows:
def ridgeRegression(xMatrix, yVector, lambdaRange):
wList = []
for i in range(1, lambdaRange+1):
lambVal = i
# compute the inner values (X.T X + lambda I)
xTranspose = np.transpose(x)
xTx = xTranspose # x
lamb_I = lambVal * np.eye(xTx.shape[0])
# invert inner, e.g. (inner)**(-1)
inner_matInv = np.linalg.inv(xTx + lamb_I)
# compute outer (X.T y)
outer_xTy = np.dot(xTranspose, y)
# multiply together
w = inner_matInv # outer_xTy
wList.append(w)
print(wList)
For testing, I am running it with the first 5 lambda values.
wList becomes 5 numpy.arrays each of length 10 (I'm assuming for the 10 coefficients).
Here is the first of those 5 arrays:
array([ 0.29686755, 1.48420319, 0.36388528, 0.70324668, -0.51604451,
2.39045735, 1.45295857, 2.21437745, 0.98222546, 0.86124358])
My question, and clarification:
Shouldn't there be 11 coefficients, (1 for the y-intercept + 10 slopes)?
How do I get the Minimum Square Error from this computation?
What comes next if I wanted to plot this line?
I think I am just really confused as to what I'm looking at, since I'm still working on my linear-algebra.
Thanks!
First, I would modify your ridge regression to look like the following:
import numpy as np
def ridgeRegression(X, y, lambdaRange):
wList = []
# Get normal form of `X`
A = X.T # X
# Get Identity matrix
I = np.eye(A.shape[0])
# Get right hand side
c = X.T # y
for lambVal in range(1, lambdaRange+1):
# Set up equations Bw = c
lamb_I = lambVal * I
B = A + lamb_I
# Solve for w
w = np.linalg.solve(B,c)
wList.append(w)
return wList
Notice that I replaced your inv call to compute the matrix inverse with an implicit solve. This is much more numerically stable, which is an important consideration for these types of problems especially.
I've also taken the A=X.T#X computation, identity matrix I generation, and right hand side vector c=X.T#y computation out of the loop--these don't change within the loop and are relatively expensive to compute.
As was pointed out by #qwr, the number of columns of X will determine the number of coefficients you have. You have not described your model, so it's not clear how the underlying domain, x, is structured into X.
Traditionally, one might use polynomial regression, in which case X is the Vandermonde Matrix. In that case, the first coefficient would be associated with the y-intercept. However, based on the context of your question, you seem to be interested in multivariate linear regression. In any case, the model needs to be clearly defined. Once it is, then the returned weights may be used to further analyze your data.
Typically to make notation more compact, the matrix X contains a column of ones for an intercept, so if you have p predictors, the matrix is dimensions n by p+1. See Wikipedia article on linear regression for an example.
To compute in-sample MSE, use the definition for MSE: the average of squared residuals. To compute generalization error, you need cross-validation.
Also, you shouldn't take lambVal as an integer. It can be small (close to 0) if the aim is just to avoid numerical error when xTx is ill-conditionned.
I would advise you to use a logarithmic range instead of a linear one, starting from 0.001 and going up to 100 or more if you want to. For instance you can change your code to that:
powerMin = -3
powerMax = 3
for i in range(powerMin, powerMax):
lambVal = 10**i
print(lambVal)
And then you can try a smaller range or a linear range once you figure out what is the correct order of lambVal with your data from cross-validation.
I am following this tutorial on NN and backpropagation.
I am new to python and I am trying to convert the code to MATLAB.
Can someone kindly explain the following code line (from the tutorial) :
delta3[range(num_examples), y] -= 1
In short, and if I am not mistaken, delta3 and y are vectors and num_examples is an integer.
Ii is my understanding that delta3=probs-y as in this math exchange entry(Thank you #rayryeng). Why and when should I subtract 1?
Otherwise can anybody direct me to an online site I can simply run and follow the code? I was getting errors everywhere I tried to run (including my home PC):
"NameError: name 'sklearn' is not defined" (probably an import I am missing)
This line: delta3[range(num_examples), y] -= 1 is part of calculating the gradient of the softmax loss function. I refer you to this nice link that gives you more information on how this loss function is formulated and the intuition behind it: http://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/.
In addition, I refer you to this post on Mathematics Stack Exchange that shows you how the gradient of the softmax loss is derived: https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function. Consider the first link as a deep dive whereas the second link is a tl;dr of the first link.
The gradient of the softmax loss function is the gradient of the output layer which you would need to propagate backwards into the layer before the output layer to continue with the backpropagation algorithm.
Summarizing the post I have linked above, if you calculate the gradient of the softmax loss for a training example, for each class the gradient of the loss is simply the softmax value evaluated for that class. You additionally need to subtract the loss value by 1 for the class the actual training example belongs to. Remember that the gradient of an example for a class i is equal to p_i - y_i where p_i is the softmax score of class i for the example and y_i is the classification label using a one-hot encoding scheme. Specifically y_i = 0 if i is not the true class of the example and y_i = 1 if it is. delta3 contains the gradient of the softmax loss function per example in your mini-batch. Specifically, it is a 2D matrix where the total number of rows is equal to the number of training examples, or num_examples while the number of columns is the total number of classes.
Firstly we calculate the softmax scores for each training example and for each class. Next for each row of the gradient, we determine the column location that corresponds to the true class the example belongs to and we subtract the scores by 1. range(num_examples) would generate a list from 0 up to num_examples - 1 and y contains the true class labels per example. Therefore, for each pair of range(num_examples) and y, this accesses the right row and column location to subtract 1 by to finalize the gradient of the loss function.
Now in the Mathematics Stack Exchange post as well as your understanding, the gradient is delta3 = probs - y. This assumes that y is a one-hot encoded matrix, meaning that y has the same size as probs and for each row of y it is all zero except for the column index that contains the correct class which is set to 1. Therefore if you think about it correctly, if you generated a matrix y where for each row the columns are all zero except for the class number that example belongs to, it is equivalent to simply accessing the right column for each row and subtracting the score by 1.
In MATLAB you actually need to create the linear indices in order to facilitate this subtraction. Specifically, you need to use sub2ind to convert these row and column locations to linear indices, then we can access the gradient matrix and subtract the values by 1.
Therefore:
ind = sub2ind(size(delta3), 1 : num_examples, y + 1);
delta3(ind) = delta3(ind) - 1;
In the Python tutorial you have linked, the class labels are assumed to be from 0 up to N-1 where N is the total number of classes. You must be careful in MATLAB where we start indexing arrays starting at 1, so I have added 1 to y in the above code to ensure that your labels start at 1 instead of 0. ind contains the linear indices of the row and column locations that we need to access and we thus complete the subtraction using those indices.
If you were to formulate this using the knowledge that you gained from your edit, you would do this instead:
ymatrix = full(sparse(1 : num_examples, y + 1, 1, size(delta3, 1), size(delta3, 2));
delta3 = probs - ymatrix;
ymatrix contains the matrix that I talked about where each row corresponds to an example with all zeroes except for the column that pertains to the class the example belongs to, which is 1. What you may have not seen before is the sparse and full functions. sparse allows you to create a zero matrix and you can specify the row and column locations that are non-zero as well as the values that each of these locations take on. In this case, I'm exactly accessing one element per row and using the class ID for the example to access the columns and setting each of these locations to 1. Also remember that I'm adding by 1 as I'm assuming your class IDs start from 0. Because this is a sparse matrix, I then convert this to full to give you a numeric matrix rather than representing it in sparse form. Therefore, this code is equivalent in operation to the previous code snippet I showed. However, it is more efficient to do it the first way as you are not creating an additional matrix to facilitate the gradient computation. You are modifying the gradient in place instead.
As a sidenote, sklearn is the scikit-learn Python machine learning package, and the NameError is in reference to you not having the actual package installed. To install it, use pip or easy_install to install the Python package to your computer.... so in your command line, it's as simple as:
pip install sklearn
or:
easy_install sklearn
However, scikit-learn should not be required for you to run the above subtraction code. You do need NumPy though so make sure you have that package installed.
For pip:
pip install numpy
... and for easy_install:
easy_install numpy