subsetting numpy array to rows within a d-dimensional hypercube - python

I have a numpy array of shape n x d. Each row represents a point in R^d. I want to filter this array to only rows within a given distance on each axis of a single point--a d-dimensional hypercube, as it were.
In 1 dimension, this could be:
array[np.which(array < lmax and array > lmin)]
where lmax and lmin are the max and min relevant to the point+-distance. But I want to do this in d dimensions. d is not fixed, so hard-coding it out doesn't work. I checked to see if the above works where lmax and lmin are d-length vectors, but it just flattens the array.
I know I could plug the matrix and the point into a distance calculator like scipy.spatial.distance and get some sort of distance metric, but that's likely slower than some simple filtering (if it exists) would be.
The fact I have to do this calculation potentially millions of times means Ideally I'd like a fast solution.

You can try this.
def test(array):
large = array > lmin
small = array < lmax
return array[[i for i in range(array.shape[0])
if np.all(large[i]) and np.all(small[i])]]
For every i, array[i] is a vector. All the elements of a vector should be in range [lmin, lmax], and this process of calculation can be vectorized.

Related

Writing Code using NumPy without any loops

I am writing a program that utilizes NumPy to calculate accuracy between testing and training points, but I am not sure how to utilize the vectorized functions as opposed to the for loops I have used in my code.
Here is my code(Is there a way to simply the code so that I do not need any loops?)
ty#command to import NumPy package
import numpy as np
iris_train=np.genfromtxt("iris-train-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
iris_test=np.genfromtxt("iris-test-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
train_cat=np.genfromtxt("iris-training-data.csv",delimiter=',',usecols=(4),dtype=str)
test_cat=np.genfromtxt("iris-testing-data.csv",delimiter=',',usecols=(4),dtype=str)
correct = 0
for i in range(len(iris_test)):
n = 0
old_distance = float('inf')
while n < len(iris_train):
#finding the difference between test and train point
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
#summing up the calculated differences
iris_sum = sum(iris_diff)
new_distance = float(np.sqrt(iris_sum))
#if statement to update distance
if new_distance < old_distance:
index = n
old_distance = new_distance
n += 1
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1
accuracy = ((correct)/float((len(iris_test)))*100)
print(f"Accuracy:{accuracy: .2f}%")pe here
:
The trick with computing the distances is to insert extra dimensions using numpy.newaxis and use broadcasting to compute a matrix with the distance from every testing sample to every training sample in one vectorized operation. Using numpy's broadcasting rules, diff has shape (num_test_samples, num_train_samples, num_features), and distance has shape (num_test_samples, num_train_samples) since we summed along the last axis in the call to numpy.sum.
Then you can use numpy.argmin to find the index of the closest training sample for every testing sample. index has shape (num_test_samples, ) since we did the reduction operation along the last axis of distance.
Finally, you can use index to select the training classification closest
to the testing classification. We can construct a boolean array that represents the equality between the testing classification and the closest training classification using the == operator. The number of correct classifications is then the sum of the True elements of this boolean array. Since True is casted to 1 and False is casted to 0 we can simply sum this boolean array to get the number of correct classifications.
# Compute the distance from every training sample to every testing sample
# Note that `np.sqrt` is not necessary since sqrt is a monotonically
# increasing function -- removing it doesn't change the answer
diff = iris_test[:, np.newaxis] - iris_train[np.newaxis, :]
distance = np.sqrt(np.sum(np.square(diff), axis=-1))
# Compute the index of the closest training sample to the testing sample
index = np.argmin(distance, axis=-1)
# Check if class of the closest training sample matches the class
# of the testing sample
correct = (test_cat == train_cat[index]).sum()
If I get correctly what you are doing (but I don't really need to, to answer the question), for each vector of iris_test, you are searching for the closest one in isis_train. Closest being here in the sense of euclidean distance.
So you have 3 embedded loop (pseudo-python)
for u in iris_test:
for v in iris_train:
s=0
for i in range(dimensionOfVectors):
s+=(iris_test[i]-iris_train[i])**2
dist=sqrt(s)
You are right to try to get rid of python loops. And the most important one to get rid of is the inner one. And you already got rid of this one. Since the inner loop of my pseudo code is, in your code, implicitly in:
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
and
iris_sum = sum(iris_diff)
Both those line iterates through all dimensions of your vectors. But do it not in python but in internal numpy code, so it is fast.
One may object that you don't really need abs after a **2, that you could have called the np.linalg.norm function that does all those operations in one call
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
which is faster than your code. But at least, in your code, that loop over all components of the vectors is already vectorized.
The next stage is to vectorize the middle loop.
That also can be accomplished. Instead of computing one by one
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
You could compute in one call all the len(iris_train) distances between iris_test[i] and all iris_train[n].
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
The trick here lies in numpy broadcasting and axis parameter
broadcasting means that you can compute the difference between a 1D, length W vector, and a 2D n×W array (iris_test[0] is a 1D vector, and iris_train is 2D-array whose number of columns is the same as the length of iris_test[0]). Because in such case, numpy broadcasts the 1st operator, and returns a 2D n×W array as result, whose each line k is iris_test[0] - iris_train[k].
Calling np.linalg.norm on that n×W 2D matrix would return a single float (the norm of the whole matrix). Unless you restrict the norm to the 2nd axis (axis=1). In which case, it returns n floats, each of them being the norm of one row.
In other words, after the previous line of code, new_distances[k] is the distance between iris_test[i] and iris_train[k].
Once that done, you can easily find k such as this distance is the smallest, using np.argmin.
np.argmin(new_distances) is the index of the smallest of the distances.
So, all together, your code could be rewritten as:
correct = 0
for i in range(len(iris_test)):
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
index=np.argmin(new_distances)
#printing out classifications
print(i + 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct += 1

How to take the average value of respective elements in arrays

I have a chunk of code that runs 1000 times and produces 1000 covariance matrices. How do I calculate the average value for each element in the matrices and then print that average matrix?
params_avg1=[]
pcov1avg=[]
i=1000
for n in range(i):
y3=y2+np.random.normal(loc=0.0,scale=.1*y2)
popt1,pcov1=optimize.curve_fit(fluxmeasureMW,bands,y3)
params_avg1.append(popt1)
pcov1avg.append(pcov1) #returns an array of 1000 3x3 covariance matrices
As you already appended all your matrices into a single array, transform it to a 3D numpy array and then average on the correct axis:
np.array(pcov1avg).mean(axis=0) # or equivalently np.mean(pcov1avg, 0)
And just a bit about naming - i usually denotes the current index of the iteration rather than the end value, usually denoted with n

Numpy: Perform Multiplication-like Addition

I wanted to define my own addition operator that takes an Nx1 vector (call it A) and a 1xN vector (B) such that the element in the i^th row and j^th column is the sum of the i^th element in A and the j^th element in B. An example is illustrated here.
I was able to write the following code for the function (and it is correct as far as I know).
def test_fn(a, b):
a_len = a.shape[0]
b_len = b.shape[1]
prod = np.array([[0]*a_len]*b_len)
for i in range(a_len):
for j in range(b_len):
prod[i, j] = a[i, 0] + b[0, j]
return prod
However, the vectors I am working with contain thousands of elements, and the function above is quite slow. I was wondering if there was a better way to approach this problem, or if there was a numpy function that could be of use. Any help would be appreciated.
According to numpy's broadcasting rules, you can use a+b to implement your own defined operator.
The first rule of broadcasting is that if all input arrays do not have the same number of dimensions, a “1” will be repeatedly prepended to the shapes of the smaller arrays until all the arrays have the same number of dimensions.
The second rule of broadcasting ensures that arrays with a size of 1 along a particular dimension act as if they had the size of the array with the largest shape along that dimension. The value of the array element is assumed to be the same along that dimension for the “broadcast” array.

Efficient NumPy rows rotation over variable distances

Given a 2D M x N NumPy array and a list of rotation distances, I want to rotate all M rows over the distances in the list. This is what I currently have:
import numpy as np
M = 6
N = 8
dists = [2,0,2,1,4,2] # for example
matrix = np.random.randint(0,2,(M,N))
for i in range(M):
matrix[i] = np.roll(matrix[i], -dists[i])
The last two lines are actually part of an inner loop that gets executed hundreds of thousands of times and it is bottlenecking my performance as measured by cProfile. Is it possible to, for instance, avoid the for-loop and to do this more efficiently?
We can simulate the rolling behaviour with modulus operation after adding dists with a range(0...N) array to give us column indices for each row from where elements are to be picked and shuffled in the same row. We can vectorize this process across all rows with the help of broadcasting. Thus, we would have an implementation like so -
M,N = matrix.shape # Store matrix shape
# Get column indices for all elems for a rolled version with modulus operation
col_idx = np.mod(np.arange(N) + dists[:,None],N)
# Index into matrix with ranged row indices and col indices to get final o/p
out = matrix[np.arange(M)[:,None],col_idx]

Null matrix with constant diagonal, with same shape as another matrix

I'm wondering if there is a simple way to multiply a numpy matrix by a scalar. Essentially I want all values to be multiplied by the constant 40. This would be an nxn matrix with 40's on the diagonal, but I'm wondering if there is a simpler function to use to scale this matrix. Or how would I go about making a matrix with the same shape as my other matrix and fill in its diagonal?
Sorry if this seems a bit basic, but for some reason I couldn't find this in the doc.
If you want a matrix with 40 on the diagonal and zeros everywhere else, you can use NumPy's function fill_diagonal() on a matrix of zeros. You can thus directly do:
N = 100; value = 40
b = np.zeros((N, N))
np.fill_diagonal(b, value)
This involves only setting elements to a certain value, and is therefore likely to be faster than code involving multiplying all the elements of a matrix by a constant. This approach also has the advantage of showing explicitly that you fill the diagonal with a specific value.
If you want the diagonal matrix b to be of the same size as another matrix a, you can use the following shortcut (no need for an explicit size N):
b = np.zeros_like(a)
np.fill_diagonal(b, value)
Easy:
N = 100
a = np.eye(N) # Diagonal Identity 100x100 array
b = 40*a # Multiply by a scalar
If you actually want a numpy matrix vs an array, you can do a = np.asmatrix(np.eye(N)) instead. But in general * is element-wise multiplication in numpy.

Categories

Resources