Nearest Neighbor using customized weights on Python scikit-learn - python

Good night,
I would like to use Nearest Neighbor model for Regression with non-uniform weights. I saw in the User Guide that I can use weights='distance' in the declaration of the model and then the weights would be inverse proportional to the distance, but the results I get were not what I wanted.
I saw in the Documentation that I could use a function for the weights (given the distances) used in the prediction, so I have created the follow function:
from sklearn.neighbors import KNeighborsRegressor
import numpy
nparray = numpy.array
def customized_weights(distances: nparray)->nparray:
for distance in distances:
if (distance >= 100 or distance <= -100):
yield 0
yield (1 - abs(distance)/100)
And have declared the method like this:
knn: KNeighborsRegressor = KNeighborsRegressor(n_neighbors=50, weights=customized_weights ).fit(X_train, y_train)
Until that part, everything works fine. But when I tried to predict with the model, I get the error:
File "knn_with_weights.py", line 14, in customized_weights
if (distance >= 100 or distance <= -100):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I did not understand what I did wrong. On the Documentation it is written that my function should has an array of distances as parameter and should return the equivalent weights. What have I done wrong?
Thanks in advance.

I don't know much about this type of regression, but it is certainly possible that the distances passed into this is a 2-dimensional data structure, which would make sense for all pairwise-distances.
Why don't you put a little teaser print statement into your custom function to print both distances and distances.shape

The #Jeff H's tip directed me to the answer.
The input parameter of this function is a two dimensional numpy array distances with shape (predictions, neighbors), where:
predictions is the number of desired predictions (when you call knn.predict(X_1, X_2, X_3, ...);
neighbors, the number of neighbors used (in my case, n_neighbors=50).
Each element distances[i, j] represents the distance for the i prediction, from the j nearest neighbor (the smaller j, smaller the distance).
The function must return an array with the same dimensions as the input array, with the weight corresponding to each distance.
I do not know if it is the fastest way, but I came up with this solution:
def customized_weights(distances: nparray)->nparray:
weights: nparray = nparray(numpy.full(distances.shape, 0), dtype='float')
# create a new array 'weights' with the same dimension of 'distances' and fill
# the array with 0 element.
for i in range(distances.shape[0]): # for each prediction:
if distances[i, 0] >= 100: # if the smaller distance is greather than 100,
# consider the nearest neighbor's weight as 1
# and the neighbor weights will stay zero
weights[i, 0] = 1
# than continue to the next prediction
continue
for j in range(distances.shape[1]): # aply the weight function for each distance
if (distances[i, j] >= 100):
continue
weights[i, j] = 1 - distances[i, j]/100
return weights

Related

Writing Code using NumPy without any loops

I am writing a program that utilizes NumPy to calculate accuracy between testing and training points, but I am not sure how to utilize the vectorized functions as opposed to the for loops I have used in my code.
Here is my code(Is there a way to simply the code so that I do not need any loops?)
ty#command to import NumPy package
import numpy as np
iris_train=np.genfromtxt("iris-train-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
iris_test=np.genfromtxt("iris-test-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
train_cat=np.genfromtxt("iris-training-data.csv",delimiter=',',usecols=(4),dtype=str)
test_cat=np.genfromtxt("iris-testing-data.csv",delimiter=',',usecols=(4),dtype=str)
correct = 0
for i in range(len(iris_test)):
n = 0
old_distance = float('inf')
while n < len(iris_train):
#finding the difference between test and train point
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
#summing up the calculated differences
iris_sum = sum(iris_diff)
new_distance = float(np.sqrt(iris_sum))
#if statement to update distance
if new_distance < old_distance:
index = n
old_distance = new_distance
n += 1
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1
accuracy = ((correct)/float((len(iris_test)))*100)
print(f"Accuracy:{accuracy: .2f}%")pe here
:
The trick with computing the distances is to insert extra dimensions using numpy.newaxis and use broadcasting to compute a matrix with the distance from every testing sample to every training sample in one vectorized operation. Using numpy's broadcasting rules, diff has shape (num_test_samples, num_train_samples, num_features), and distance has shape (num_test_samples, num_train_samples) since we summed along the last axis in the call to numpy.sum.
Then you can use numpy.argmin to find the index of the closest training sample for every testing sample. index has shape (num_test_samples, ) since we did the reduction operation along the last axis of distance.
Finally, you can use index to select the training classification closest
to the testing classification. We can construct a boolean array that represents the equality between the testing classification and the closest training classification using the == operator. The number of correct classifications is then the sum of the True elements of this boolean array. Since True is casted to 1 and False is casted to 0 we can simply sum this boolean array to get the number of correct classifications.
# Compute the distance from every training sample to every testing sample
# Note that `np.sqrt` is not necessary since sqrt is a monotonically
# increasing function -- removing it doesn't change the answer
diff = iris_test[:, np.newaxis] - iris_train[np.newaxis, :]
distance = np.sqrt(np.sum(np.square(diff), axis=-1))
# Compute the index of the closest training sample to the testing sample
index = np.argmin(distance, axis=-1)
# Check if class of the closest training sample matches the class
# of the testing sample
correct = (test_cat == train_cat[index]).sum()
If I get correctly what you are doing (but I don't really need to, to answer the question), for each vector of iris_test, you are searching for the closest one in isis_train. Closest being here in the sense of euclidean distance.
So you have 3 embedded loop (pseudo-python)
for u in iris_test:
for v in iris_train:
s=0
for i in range(dimensionOfVectors):
s+=(iris_test[i]-iris_train[i])**2
dist=sqrt(s)
You are right to try to get rid of python loops. And the most important one to get rid of is the inner one. And you already got rid of this one. Since the inner loop of my pseudo code is, in your code, implicitly in:
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
and
iris_sum = sum(iris_diff)
Both those line iterates through all dimensions of your vectors. But do it not in python but in internal numpy code, so it is fast.
One may object that you don't really need abs after a **2, that you could have called the np.linalg.norm function that does all those operations in one call
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
which is faster than your code. But at least, in your code, that loop over all components of the vectors is already vectorized.
The next stage is to vectorize the middle loop.
That also can be accomplished. Instead of computing one by one
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
You could compute in one call all the len(iris_train) distances between iris_test[i] and all iris_train[n].
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
The trick here lies in numpy broadcasting and axis parameter
broadcasting means that you can compute the difference between a 1D, length W vector, and a 2D n×W array (iris_test[0] is a 1D vector, and iris_train is 2D-array whose number of columns is the same as the length of iris_test[0]). Because in such case, numpy broadcasts the 1st operator, and returns a 2D n×W array as result, whose each line k is iris_test[0] - iris_train[k].
Calling np.linalg.norm on that n×W 2D matrix would return a single float (the norm of the whole matrix). Unless you restrict the norm to the 2nd axis (axis=1). In which case, it returns n floats, each of them being the norm of one row.
In other words, after the previous line of code, new_distances[k] is the distance between iris_test[i] and iris_train[k].
Once that done, you can easily find k such as this distance is the smallest, using np.argmin.
np.argmin(new_distances) is the index of the smallest of the distances.
So, all together, your code could be rewritten as:
correct = 0
for i in range(len(iris_test)):
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
index=np.argmin(new_distances)
#printing out classifications
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1

Multiplying subarrays of tensor

I am trying to implement a multivariate Gaussian Mixture Model and am trying to calculate the probability distribution function using tensors. There are n data points, k clusters, and d dimensions. So far, I have two tensors. One is a (n,k,d) tensor of centered data points and the other is a kxdxd tensor of covariance matricies. I can compute an nxk matrix of probabilities by doing
centered = np.repeat(points[:,np.newaxis,:],K,axis=1) - mu[np.newaxis,:] # KxNxD
prob = np.zeros(n,k)
constant = 1/2/np.pow(np.pi, d/2)
for n in range(centered.shape[1]):
for k in range(centered.shape[0]):
p = centered[n,k,:][np.newaxis] # 1xN
power = -1/2*(p # np.linalg.inv(sigma[k,:,:]) # p.T)
prob[n,k] = constant * np.linalg.det(sigma[k,:,:]) * np.exp(power)
where sigma is the triangularized kxdxd matrix of covariances and centered are mypoints. What is a more pythonic way of doing this using numpy's tensor capabilites?
Just a couple of quick observations:
I don't see you using p in the loop; is this a mistake? Using n instead?
The T in centered[n,k,:].T does nothing; with that index the array is 1d
I'm not sure if np.linal.inv can handle batches of arrays, allowing np.linalg.inv(sigma).
# allows batches, just so long as the last 2 dim are the ones entering into the dot (with the usual last of A, 2nd to the last of B rule; einsum can also be used.
again does np.linalg.det handle batches?

Should np.linalg.norm be squared when implementing k-means clustering algorithm?

The k-means clustering algorithm objective is to find:
I looked at several implementations of it in python, and in some of them the norm is not squared.
For example (taken from here):
def form_clusters(labelled_data, unlabelled_centroids):
"""
given some data and centroids for the data, allocate each
datapoint to its closest centroid. This forms clusters.
"""
# enumerate because centroids are arrays which are unhashable
centroids_indices = range(len(unlabelled_centroids))
# initialize an empty list for each centroid. The list will
# contain all the datapoints that are closer to that centroid
# than to any other. That list is the cluster of that centroid.
clusters = {c: [] for c in centroids_indices}
for (label,Xi) in labelled_data:
# for each datapoint, pick the closest centroid.
smallest_distance = float("inf")
for cj_index in centroids_indices:
cj = unlabelled_centroids[cj_index]
distance = np.linalg.norm(Xi - cj)
if distance < smallest_distance:
closest_centroid_index = cj_index
smallest_distance = distance
# allocate that datapoint to the cluster of that centroid.
clusters[closest_centroid_index].append((label,Xi))
return clusters.values()
And to give the contrary, expected, implementation (taken from here; this is just the distance calculation):
import numpy as np
from numpy.linalg import norm
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
Now, I know there are several ways to calculate the norm\distance, but I looked only at implementations that used np.linalg.norm with ord=None or ord=2, and as I said, in some of them the norm is not squared, yet they cluster correctly.
Why?
By experience, to use the norm or the squared norm as the objective function of an optimization algorithm yields to similar results. The minimum value of the objetive function will change, but the parameters obtained will be the same. I always guessed that the inner product generates a quadratic function and the root of that product only changed the magnitude but not the objetive function topology. A more detailed answer can be found in here. https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution
Hope it helps.

Calculating Mean Squared Error through Matrix Arithmetic on Numpy Matrices of Binary Images

I have 2 binary images, one is a ground truth, and one is an image segmentation that I produced.
I am trying to calculate the mean squared distance ...
Let G = {g1, g2, . . . , gN} be the points in the ground truth boundary.
Let B = {b1, b2, . . . , bM} be the points in the segmented boundary.
Define d(p, p0) be a measure of distance between points p and p0 (e.g. Euclidean, city block, etc.)
between the two images using the following algorithm.
def MSD(A,G):
'''
Takes a thresholded binary image, and a ground truth img(binary), and computes the mean squared absolute difference
:param A: The thresholded binary image
:param G: The ground truth img
:return:
'''
sim = np.bitwise_xor(A,G)
sum = 0
for i in range(0,sim.shape[0]):
for j in range(0,sim.shape[1]):
if (sim[i,j] == True):
min = 9999999
for k in range(0,sim.shape[0]):
for l in range(0,sim.shape[1]):
if (sim[k, l] == True):
e = abs(i-k) + abs(j-l)
if e < min:
min = e
mink = k
minl = l
sum += min
return sum/(sim.shape[0]*sim.shape[1])
This algorithm is too slow though and never completes.
This example and this example (Answer 3) might show method of how to get the mean squared error using Matrix arithmetic, but I do not understand how these examples make any sense or why they work.
So if I understand your formula and code correctly, you have one (binary) image B and a (ground truth) image G. "Points" are defined by the pixel positions where either image has a True (or at least nonzero) value. From your bitwise_xor I deduce that both images have the same shape (M,N).
So the quantity d^2(b,g) is at worst an (M*N, M*N)-sized array, relating each pixel of B to each pixel of G. It's even better: we only need a shape (m,n) if there are m nonzeros in B and n nonzeros in G. Unless your images are huge we can get away with keeping track of this large quantity. This will cost memory but we will win a lot of CPU time by vectorization. So then we only have to find the minimum of this distance with respect to every n possible value, for each m. Then just sum up each minimum. Note that the solution below uses extreme vectorization, and it can easily eat up your memory if the images are large.
Assuming Manhattan distance (with the square in d^2 which seems to be missing from your code):
import numpy as np
# generate dummy data
M,N = 100,100
B = np.random.rand(M,N) > 0.5
G = np.random.rand(M,N) > 0.5
def MSD(B, G):
# get indices of nonzero pixels
nnz_B = B.nonzero() # (x_inds, y_inds) tuple, x_inds and y_inds are shape (m,)
nnz_G = G.nonzero() # (x_inds', y_inds') each with shape (n,)
# np.array(nnz_B) has shape (2,m)
# compute squared Manhattan distance
dist2 = abs(np.array(nnz_B)[...,None] - np.array(nnz_G)[:,None,:]).sum(axis=0)**2 # shape (m,n)
# alternatively: Euclidean for comparison:
#dist2 = ((np.array(nnz_B)[...,None] - np.array(nnz_G)[:,None,:])**2).sum(axis=0)
mindist2 = dist2.min(axis=-1) # shape (m,) of minimum square distances
return mindist2.mean() # sum divided by m, i.e. the MSD itself
print(MSD(B, G))
If the above uses too much memory we can introduce a loop over the elements of nnz_B, and only vectorize in the elements of nnz_G. This will take more CPU power and less memory. This trade-off is typical for vectorization.
An efficient method for calculating this distance is using the Distance Transform. SciPy has an implementation in the ndimage package: scipy.ndimage.morphology.distance_transform_edt.
The idea is to compute a distance transform for the background of the ground-truth image G. This leads to a new image D that is 0 for each pixel that is nonzero in G, and for each zero pixel in G there will be the distance to the nearest nonzero pixel.
Next, for each nonzero pixel in B (or A in the code that you posted), you look at the corresponding pixel in D. This is the distance to G for that pixel. So, simply average all the values in D for which B is nonzero to obtain your result.
import numpy as np
import scipy.ndimage as nd
import matplotlib.pyplot as pp
# Create some test data
img = pp.imread('erika.tif') # a random image
G = img > 120 # the ground truth
img = img + np.random.normal(0, 20, img.shape)
B = img > 120 # the other image
D = nd.morphology.distance_transform_edt(~G)
msd = np.mean(D[B]**2)

Calculate Hits At metric in Theano

I am using keras to build a recommender model. Because the item set is quite large, I'd like to calculate the Hits # N metric as a measure of accuracy. That is, if the observed item is in the top N predicted, it counts as relevant recommendation.
I was able to build the hits at N function using numpy. But as I'm trying to port it into a custom loss function for keras, I'm having problem with the tensors. Specifically, enumerating over a tensor is different. And when I looked into the syntax to find something equivalent, I started to question the whole approach. It's sloppy and slow, reflective of my general python familiarity.
def hits_at(y_true, y_pred): #numpy version
a=y_pred.argsort(axis=1) #ascending, sort by row, return index
a = np.fliplr(a) #reverse to get descending
a = a[:,0:10] #return only the first 10 columns of each row
Ybool = [] #initialze 2D arrray
for t, idx in enumerate(a):
ybool = np.zeros(num_items +1) #zero fill; 0 index is reserved
ybool[idx] = 1 #flip the recommended item from 0 to 1
Ybool.append(ybool)
A = map(lambda t: list(t), Ybool)
right_sum = (A * y_true).max(axis=1) #element-wise multiplication, then find the max
right_sum = right_sum.sum() #how many times did we score a hit?
return right_sum/len(y_true) #fraction of observations where we scored a hit
How should I approach this in a more compact, and tensor-friendly way?
Update:
I was able to get a version of Top 1 working. I based it loosely on the GRU4Rec description
def custom_objective(y_true, y_pred):
y_pred_idx_sort = T.argsort(-y_pred, axis=1)[:,0] #returns the first element, which is the index of the row with the largest value
y_act_idx = T.argmax(y_true, axis=1)#returns an array of indexes with the top value
return T.cast(-T.mean(T.nnet.sigmoid((T.eq(y_pred_idx_sort,y_act_idx)))), theano.config.floatX)`
I just had to compare the array of top 1 predictions to the array of the actuals element-wise. And Theano has an eq() function to do that.
Independent of N, the number of possible values of your loss function is finite. Therefore it can't be differentiable in a sensible tensor way and you cannot use it as loss function in Keras / Theano. You may try to use a theano log loss with top N guys.
UPDATE :
In Keras - you may write your own loss functions. They have a declaration of a form :
def loss_function(y_pred, y_true):
Both y_true and y_pred are numpy arrays, so you may obtain easly a vector v which is 1 when an example given is in top 500 and 0 otherwise. Then you may transform it to theano tensor constant vector and apply it in a way :
return theano.tensor.net.binary_crossentropy(y_pred * v, y_true * v)
This should work correctly.
UPDATE 2:
Log loss is the same thing what binary_crossentropy.

Categories

Resources