Writing Code using NumPy without any loops - python

I am writing a program that utilizes NumPy to calculate accuracy between testing and training points, but I am not sure how to utilize the vectorized functions as opposed to the for loops I have used in my code.
Here is my code(Is there a way to simply the code so that I do not need any loops?)
ty#command to import NumPy package
import numpy as np
iris_train=np.genfromtxt("iris-train-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
iris_test=np.genfromtxt("iris-test-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
train_cat=np.genfromtxt("iris-training-data.csv",delimiter=',',usecols=(4),dtype=str)
test_cat=np.genfromtxt("iris-testing-data.csv",delimiter=',',usecols=(4),dtype=str)
correct = 0
for i in range(len(iris_test)):
n = 0
old_distance = float('inf')
while n < len(iris_train):
#finding the difference between test and train point
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
#summing up the calculated differences
iris_sum = sum(iris_diff)
new_distance = float(np.sqrt(iris_sum))
#if statement to update distance
if new_distance < old_distance:
index = n
old_distance = new_distance
n += 1
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1
accuracy = ((correct)/float((len(iris_test)))*100)
print(f"Accuracy:{accuracy: .2f}%")pe here
:

The trick with computing the distances is to insert extra dimensions using numpy.newaxis and use broadcasting to compute a matrix with the distance from every testing sample to every training sample in one vectorized operation. Using numpy's broadcasting rules, diff has shape (num_test_samples, num_train_samples, num_features), and distance has shape (num_test_samples, num_train_samples) since we summed along the last axis in the call to numpy.sum.
Then you can use numpy.argmin to find the index of the closest training sample for every testing sample. index has shape (num_test_samples, ) since we did the reduction operation along the last axis of distance.
Finally, you can use index to select the training classification closest
to the testing classification. We can construct a boolean array that represents the equality between the testing classification and the closest training classification using the == operator. The number of correct classifications is then the sum of the True elements of this boolean array. Since True is casted to 1 and False is casted to 0 we can simply sum this boolean array to get the number of correct classifications.
# Compute the distance from every training sample to every testing sample
# Note that `np.sqrt` is not necessary since sqrt is a monotonically
# increasing function -- removing it doesn't change the answer
diff = iris_test[:, np.newaxis] - iris_train[np.newaxis, :]
distance = np.sqrt(np.sum(np.square(diff), axis=-1))
# Compute the index of the closest training sample to the testing sample
index = np.argmin(distance, axis=-1)
# Check if class of the closest training sample matches the class
# of the testing sample
correct = (test_cat == train_cat[index]).sum()

If I get correctly what you are doing (but I don't really need to, to answer the question), for each vector of iris_test, you are searching for the closest one in isis_train. Closest being here in the sense of euclidean distance.
So you have 3 embedded loop (pseudo-python)
for u in iris_test:
for v in iris_train:
s=0
for i in range(dimensionOfVectors):
s+=(iris_test[i]-iris_train[i])**2
dist=sqrt(s)
You are right to try to get rid of python loops. And the most important one to get rid of is the inner one. And you already got rid of this one. Since the inner loop of my pseudo code is, in your code, implicitly in:
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
and
iris_sum = sum(iris_diff)
Both those line iterates through all dimensions of your vectors. But do it not in python but in internal numpy code, so it is fast.
One may object that you don't really need abs after a **2, that you could have called the np.linalg.norm function that does all those operations in one call
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
which is faster than your code. But at least, in your code, that loop over all components of the vectors is already vectorized.
The next stage is to vectorize the middle loop.
That also can be accomplished. Instead of computing one by one
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
You could compute in one call all the len(iris_train) distances between iris_test[i] and all iris_train[n].
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
The trick here lies in numpy broadcasting and axis parameter
broadcasting means that you can compute the difference between a 1D, length W vector, and a 2D n×W array (iris_test[0] is a 1D vector, and iris_train is 2D-array whose number of columns is the same as the length of iris_test[0]). Because in such case, numpy broadcasts the 1st operator, and returns a 2D n×W array as result, whose each line k is iris_test[0] - iris_train[k].
Calling np.linalg.norm on that n×W 2D matrix would return a single float (the norm of the whole matrix). Unless you restrict the norm to the 2nd axis (axis=1). In which case, it returns n floats, each of them being the norm of one row.
In other words, after the previous line of code, new_distances[k] is the distance between iris_test[i] and iris_train[k].
Once that done, you can easily find k such as this distance is the smallest, using np.argmin.
np.argmin(new_distances) is the index of the smallest of the distances.
So, all together, your code could be rewritten as:
correct = 0
for i in range(len(iris_test)):
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
index=np.argmin(new_distances)
#printing out classifications
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1

Related

Function for calculating the determinant of a matrix

I want my function to calculate the determinant of input Matrix A using row reduction to convert A to echelon form, after which the determinant should just be the product of the diagonal of A.
I can assume that A is an n x n np.array
This is the code that I already have:
def determinant(A):
A = np.matrix.copy(A)
row_switches = 0
# Reduce A to echelon form
for col in range(A.shape[1]):
if find_non_zero(A, col) != col:
# Switch rows
A[[find_non_zero(A, col), col], :] = A[[col, find_non_zero(A, col)], :]
row_switches += 1
# Make all 0's below "pivot"
for row in range(col+1, A.shape[0]):
factor = A[row, col] / A[col, col]
A[row, :] = A[row, :] - factor * A[col, :]
return A.diagonal().prod() * (-1) ** row_switches
# Find first non-zero value starting from diagonal element
def find_non_zero(A, n):
row = n
while row < A.shape[0] and A[row, n] == 0:
row += 1
return row
I then compare my results with np.linalg.det(A). The difference is manageable for random matrices of floats below 50x50 (2.8e-08 difference), but after 70x70, the difference is between 1000 and 10'000 on average.
What could be the cause of this?
The other problem I have with my code is that for a Matrix of ints A = np.random.randint(low=-1000,high=1000,size=(25, 25)), the difference is even more insane:
1820098560 (mine) vs 1.0853429659737294e+81 (numpy)
There are two issues with integer arrays and you can address them by changing the first line of your function to A = np.matrix(A, dtype=float).
You risk overflowing and throwing off your results completely.
>>> np.arange(1, 10).prod() # correct
362880
>>> np.arange(1, 20).prod() # incorrect
109641728
>>> np.arange(1, 20, dtype=float).prod() # correct
1.21645100408832e+17
Whatever the result of the rhs in the line A[row, :] = A[row, :] - factor * A[col, :] will be, it will be cast to an integer.
>>> a = np.zeros((3,), dtype=int)
>>> a[0] = 2.4
>>> a
array([2, 0, 0])
As for the inaccuracies with float arrays, you have to live with them because of floating arithmetic's limited precision. When the product of the diagonals gives you a number like 6.59842495617676e+17 and numpy gives 6.598424956176783e+17, you can see the results are very close. But they can only represent so many digits and when the number is very large, just a difference in the last couple of digits really means a difference in the 1000s. This will only get worse the bigger your matrices, and as a result the bigger your numbers. But in terms of relative difference, i.e., (your_method - numpy) / numpy, it's fairly good regardless the magnitude of the numbers you work with.
Stability of the algorithm
A point about your factor value when it's very small from wikipedia:
One possible problem is numerical instability, caused by the possibility of dividing by very small numbers. If, for example, the leading coefficient of one of the rows is very close to zero, then to row-reduce the matrix, one would need to divide by that number. This means that any error existed for the number that was close to zero would be amplified. Gaussian elimination is numerically stable for diagonally dominant or positive-definite matrices. For general matrices, Gaussian elimination is usually considered to be stable, when using partial pivoting, even though there are examples of stable matrices for which it is unstable.[11]
[snip]
This algorithm differs slightly from the one discussed earlier, by choosing a pivot with largest absolute value. Such a partial pivoting may be required if, at the pivot place, the entry of the matrix is zero. In any case, choosing the largest possible absolute value of the pivot improves the numerical stability of the algorithm, when floating point is used for representing numbers.
If it matters, numpy uses LAPACK's LU decomposition algorithm which implements an iterative version of Sivan Toledo's recursive LU algorithm.

Nearest Neighbor using customized weights on Python scikit-learn

Good night,
I would like to use Nearest Neighbor model for Regression with non-uniform weights. I saw in the User Guide that I can use weights='distance' in the declaration of the model and then the weights would be inverse proportional to the distance, but the results I get were not what I wanted.
I saw in the Documentation that I could use a function for the weights (given the distances) used in the prediction, so I have created the follow function:
from sklearn.neighbors import KNeighborsRegressor
import numpy
nparray = numpy.array
def customized_weights(distances: nparray)->nparray:
for distance in distances:
if (distance >= 100 or distance <= -100):
yield 0
yield (1 - abs(distance)/100)
And have declared the method like this:
knn: KNeighborsRegressor = KNeighborsRegressor(n_neighbors=50, weights=customized_weights ).fit(X_train, y_train)
Until that part, everything works fine. But when I tried to predict with the model, I get the error:
File "knn_with_weights.py", line 14, in customized_weights
if (distance >= 100 or distance <= -100):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I did not understand what I did wrong. On the Documentation it is written that my function should has an array of distances as parameter and should return the equivalent weights. What have I done wrong?
Thanks in advance.
I don't know much about this type of regression, but it is certainly possible that the distances passed into this is a 2-dimensional data structure, which would make sense for all pairwise-distances.
Why don't you put a little teaser print statement into your custom function to print both distances and distances.shape
The #Jeff H's tip directed me to the answer.
The input parameter of this function is a two dimensional numpy array distances with shape (predictions, neighbors), where:
predictions is the number of desired predictions (when you call knn.predict(X_1, X_2, X_3, ...);
neighbors, the number of neighbors used (in my case, n_neighbors=50).
Each element distances[i, j] represents the distance for the i prediction, from the j nearest neighbor (the smaller j, smaller the distance).
The function must return an array with the same dimensions as the input array, with the weight corresponding to each distance.
I do not know if it is the fastest way, but I came up with this solution:
def customized_weights(distances: nparray)->nparray:
weights: nparray = nparray(numpy.full(distances.shape, 0), dtype='float')
# create a new array 'weights' with the same dimension of 'distances' and fill
# the array with 0 element.
for i in range(distances.shape[0]): # for each prediction:
if distances[i, 0] >= 100: # if the smaller distance is greather than 100,
# consider the nearest neighbor's weight as 1
# and the neighbor weights will stay zero
weights[i, 0] = 1
# than continue to the next prediction
continue
for j in range(distances.shape[1]): # aply the weight function for each distance
if (distances[i, j] >= 100):
continue
weights[i, j] = 1 - distances[i, j]/100
return weights

subsetting numpy array to rows within a d-dimensional hypercube

I have a numpy array of shape n x d. Each row represents a point in R^d. I want to filter this array to only rows within a given distance on each axis of a single point--a d-dimensional hypercube, as it were.
In 1 dimension, this could be:
array[np.which(array < lmax and array > lmin)]
where lmax and lmin are the max and min relevant to the point+-distance. But I want to do this in d dimensions. d is not fixed, so hard-coding it out doesn't work. I checked to see if the above works where lmax and lmin are d-length vectors, but it just flattens the array.
I know I could plug the matrix and the point into a distance calculator like scipy.spatial.distance and get some sort of distance metric, but that's likely slower than some simple filtering (if it exists) would be.
The fact I have to do this calculation potentially millions of times means Ideally I'd like a fast solution.
You can try this.
def test(array):
large = array > lmin
small = array < lmax
return array[[i for i in range(array.shape[0])
if np.all(large[i]) and np.all(small[i])]]
For every i, array[i] is a vector. All the elements of a vector should be in range [lmin, lmax], and this process of calculation can be vectorized.

Efficient Euclidean distance for coordinates in different size lists in python

I've got two large point lists. One holds points that represent the edges of a rectangle (edge_points). Edge_points has xy coordinates. The other list holds points within the rectangle (all_point). All_point has xyz coordinates. In the second list, I want to remove any points that are within xy "m" distance of the points around the edge (list 1).
I have a functioning code but it is very slow what with the nested loops ... I've seen threads that suggest cdist but that won't work for my scenario where I want to compare each point in the rectangle to a single edge point. Hypot is faster than using sqrt, but still doesn't get me where I want to be.
How do I increase the efficiency of this code?
all_point=colpoint+rowpoint
all_points=[]
for n in range(0,len(all_point)):
#Calculate xy distance between inflection point and edge points
test_point=all_point[n]
dist=[]
for k in range(0,len(edge_points)):
test_edge=edge_points[k]
dist_edge=math.hypot((test_point[1]-test_edge[1])+(test_point[0]-test_edge[0]))
dist.append(dist_edge)
if all(i >= m for i in dist) is True:
all_points.append(test_point)
else:
continue
Vectorise, vectorise also applies here:
import numpy as np
all_point, edge_points = np.asanyarray(all_point), np.asanyarray(edge_points)
squared_dists = ((all_point[:, None, :2] - edge_points[None, :, :])**2).sum(axis=-1)
mask = np.all(squared_dists > m**2, axis=-1)
all_points = all_point[mask, :]
Observe that at the Python level there are no more loops. Vectorisation moves these loops to compiled code which executes several-fold faster.
Specifically, we have created two arrays all_point of size N1x2 and edge_points of size N2x2. These were then reshaped to sizes N1x1x2 and 1xN2x2 (by the Nones in the indexing).
When taking the difference this triggers broadcasting of the first two axes, such that the resulting array has shape N1xN2x2 and contains all the pairwise differences between coordinates of all_point and edge_points. The subsequent squaring applies to all N1xN2x2 elements in one go. The sum, as specified by the axis parameter is taken along the last axis, i.e. over x and y to yield an N1xN2 array of squared pairwise distances.
The next lines demonstrate the vectorised equivalent of if statements. To be able to perform them in one go one creates truth masks. The comparison with m**2 is done elementwise, therefore we get a truth value for each of the N1xN2 elements of squared_dists. np.all is similar to Python all but can do multiple groups simultaneously. This is controlled by the axis parameter. Here it specifies that all should be applied row-wise yielding N1 truth values.
This mask matches all_point in shape and can be used to extract all the coordinate pairs that fulfil the criterion.
To summarise broadcasting permitted to eliminate a nested loop from the Python level and replace it with a few vectorised operations. As long as memory is not an issue this comes with a massive speedup.
If memory is an issue: Here is a memory-saving variant that uses tensordot
all_point, edge_points = np.asanyarray(all_point), np.asanyarray(edge_points)
mixed = np.tensordot(all_point[:, :2], -2 * edge_points, (1, 1))
mixed += (all_point[:, :2]**2).sum(axis=-1)[:, None]
mixed += (edge_points**2).sum(axis=-1)[None, :]
mask = np.all(mixed > m**2, axis=-1)
all_points = all_point[mask, :]
If that's still too big, we'll have to chop up all_point or edge_points into manageable bits.
all_point, edge_points = np.asanyarray(all_point), np.asanyarray(edge_points)
mask = np.ones((len(all_point),), dtype=bool)
mm = m*m - (all_point[:,:2]**2).sum(axis=-1)[:, None]
chunksize = <choose according to your memory and data size>
for i in range(0, len(edge_points), chunksize):
mixed = np.tensordot(all_point[mask, :2], -2 * edge_points[i:i+chunksize, :], (1, 1))
mixed += (edge_points[None, i:i+chunksize, :]**2).sum(axis=-1)
mask[mask] &= np.all(mixed > mm[mask], axis=-1)
all_points = all_point[mask, :]

Vectorized Evaluation of a Function, Broadcasting and Element Wise Operations

Given this...
I have to explain what this code does, knowing that it performs the vectorized evaluation of F, using broadcasting and element wise operations concepts...
def F(x_pos, alpha):
D = x_pos.reshape(1,-1) - x_pos.reshape(-1,1)
return (1./alpha) * (alpha.reshape(1,-1) * R(D)).sum(axis=1)
My explanation is:
In the first line of the function F receives x_pos and alpha as parameters (both numpy arrays), in the second line the matrix D is calculated by means of broadcasting (basic operations such as addition in arrays numpy are performed elementwise, ie, element by element, but it is also possible with arranys of different size if numpy can transform them into others of the same size, this conversion is called broadcasting), subtracting an array of order 1xN with another of order Nx1, resulting in the matrix D of order NxN containing x_j - x_1, x_j - x_2, etc. as elements, finally, in the last line the reciprocal of alpha is calculated (which clearly is an arrangement), where each element is multiplied by the sum of the R evaluation of each cell of the matrix D multiplied by alpha_j horizontally (due to axis = 1 in the argument)
Questions:
Considering I'm new to Python, is my explanation OK?
The code has an error or not? Because I don't see that the "j must be different from 1, 2, ..., n" in each sum is taken into consideration in the code... and If it's in fact wrong... How can I fix the code so it do exactly the same thing as stated as in the image?
Few comments/improvements/fixes could be suggested here.
1] The first step could be alternatively done with just introducing a new axis and subtracting with itself, like so -
D = x_pos[:,None] - x_pos
In my opinion, this is a cleaner option. The performance benefit might be just marginal.
2] In the second line, I think it needs a fix as we need to avoid computations for the diagonal elements of R(D). So, If I got that correctly, the corrected code would be -
vals = R(D)
np.fill_diagonal(vals,0)
out = (1./alpha) * (alpha.reshape(1,-1) * vals).sum(axis=1)
Now, let's make the code a bit more idiomatic/cleaner.
At that line, we could write : (alpha * vals) instead of alpha.reshape(1,-1) * vals. This is because the shapes are already aligned for broadcasting as shown in a schematic diagram below -
alpha : n
vals : n x n
Thus, alpha would be automatically extended to 2D with its elements broadcasted along the first axis for the length of vals and then elementwise multiplications being generated with it. Again, this is meant as a cleaner code.
There's a further performance improvement possible here with (alpha.reshape(1,-1) * vals).sum(axis=1) being replaceable with a matrix-multiplicatiion using np.dot as alpha.dot(vals). The benefit on performance should be noticeable with this step.
So, the second step reduces to -
out = (1./alpha) * alpha.dot(vals)

Categories

Resources