Efficient two dimensional numpy array statistics - python

I have many 100x100 grids, is there an efficient way using numpy to calculate the median for every grid point and return just one 100x100 grid with the median values? Presently, I'm using a for loop to run through each grid point, calculating the median and then combining them into one grid at the end. I'm sure there's a better way to do this using numpy. Any help would be appreciated! Thanks!

Create as 100x100xN array (or stack together if that's not possible) and use np.median with the correct axis to do it in one go:
import numpy as np
a = np.random.rand(100,100)
b = np.random.rand(100,100)
c = np.random.rand(100,100)
d = np.dstack((a,b,c))
result = np.median(d,axis=2)

How many grids are there?
One option would be to create a 3D array that is 100x100xnumGrids and compute the median across the 3rd dimension.

use axis parameter of median:
import numpy as np
data = np.random.rand(100, 5, 5)
print np.median(data, axis=0)
print np.median(data[:, 0, 0])
print np.median(data[:, 1, 0])

Related

Get nearest coordinate in a 2D numpy array

I've seen many of the posts on how to get closes value in a numpy array, how to get closest coordinate in a 2D array etc. But none of them seem to solve what I am looking for.
The problem is, I have a 2D numpy array as such:
[[77.62881735 12.91172607]
[77.6464534 12.9230648]
[77.65330961 12.92020244]
[77.63142413 12.90909731]]
And I have one numpy array like this:
[77.64000112 12.91602265]
Now I want to find a coordinate in the 2D numpy array that is closest to the co-ordinates in 1D array.
That said, I am a beginner in these stuffs..So any input is appreciated.
I assume you mean euclidean distance. Try this:
a = np.array([[77.62881735, 12.91172607],
[77.6464534, 12.9230648],
[77.65330961,12.92020244],
[77.63142413 ,12.90909731]])
b = np.array([77.64000112, 12.91602265])
idx_min = np.sum( (a-b)**2, axis=1, keepdims=True).argmin(axis=0)
idx_min, a[idx_min]
Output:
(array([1], dtype=int64), array([[77.6464534, 12.9230648]]))
You need to implement your own "distance" computing function.
My example implements Euclidean Distance for simple
import numpy as np
import math
def compute_distance(coord1, coord2):
return math.sqrt(pow(coord1[0] - coord2[0], 2) + pow(coord1[1] - coord2[1], 2))
gallery = np.asarray([[77.62881735, 12.91172607],
[77.6464534, 12.9230648],
[77.65330961, 12.92020244],
[77.63142413, 12.90909731]])
query = np.asarray([77.64000112, 12.91602265])
distances = [compute_distance(i, query) for i in gallery]
min_coord = gallery[np.argmin(distances)]

Generate bootstrap sample from ndarray

Is there a way to generate a bootstrap sample on an N-dimensional array? I am limited to using numpy==1.19.4
I have already tried using a for loop on the other dimensions to no avail, but the following works for 1-dimensional arrays.
import numpy as np
# Set random state and number of resamples
random.seed(random_state)
n_resamples = 9999
# Generate data
data_1d = np.arange(2, 3, 0.1)
data_nd = np.random.default_rng(42).random((2,3,2))
data = data_1d.copy()
# Resample the data with replacement, computing the test statistic for each set of resamples
bs_samples = [np.std(np.random.choice(data, size=len(data))) for _ in range(n_resamples)]
If I get your problem, I use to apply this method:
suppose you have this multi-dimensionale array:
data_nd = np.random.rand(100, 3, 2)
data_nd.shape #(100, 3, 2)
you can sample elements with bootstrap in this way:
n_resamples = 99
data_nd[np.random.randint(len(data_nd), size=len(data_nd)*n_resamples)].reshape(n_resamples, *data_nd.shape).shape
what I'm doing is to randomly extract indices (randint) with replacement and finally reshape the sampling to obtain 99 bootstrapped dataset with the same dimensions of the original one.
Note that by this procedure you are considering as "elements" the arrays along the first ax and so each element that you are sampling have shape (3,2).
I hope that is clear, but if you have any doubt please let me know.

Apply bincount to each row of a 2D numpy array

Is there a way to apply bincount with "axis = 1"? The desired result would be the same as the list comprehension:
import numpy as np
A = np.array([[1,0],[0,0]])
np.array([np.bincount(r,minlength = np.max(A) + 1) for r in A])
#array([[1,1]
# [2,0]])
np.bincount doesn't work with a 2D array along a certain axis. To get the desired effect with a single vectorized call to np.bincount, one can create a 1D array of IDs such that different rows would have different IDs even if the elements are the same. This would keep elements from different rows not binning together when using a single call to np.bincount with those IDs. Thus, such an ID array could be created with an idea of linear indexing in mind, like so -
N = A.max()+1
id = A + (N*np.arange(A.shape[0]))[:,None]
Then, feed the IDs to np.bincount and finally reshape back to 2D -
np.bincount(id.ravel(),minlength=N*A.shape[0]).reshape(-1,N)
If the data is too large for this to be efficient, then the issue is more likely to be the memory usage of the dense matrix rather than the numerical operations themself. Here is an example of using a sklearn Hashing Vectorizer on a matrix which is too large to use the bincounts method (the results are a sparse matrix):
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
h = HashingVectorizer()
A = np.random.randint(100,size=(1000,100))*10000
A_str = [" ".join([str(v) for v in i]) for i in A]
%timeit h.fit_transform(A_str)
#10 loops, best of 3: 110 ms per loop
You can use apply_along_axis, Here is an example
import numpy as np
test_array = np.array([[0, 0, 1], [0, 0, 1]])
print(test_array)
np.apply_along_axis(np.bincount, axis=1, arr= test_array,
minlength = np.max(test_array) +1)
Note the final shape of this array depends on the number of bins, also you can specify other arguments along with apply_along_axis

Python - sparse vectors/distance calculation

I'm looking for dynamically growing vectors in Python, since I don't know their length in advance. In addition, I would like to calculate distances between these sparse vectors, preferably using the distance functions in scipy.spatial.distance (although any other suggestions are welcome). Any ideas how to do this? (Initially, it doesn't need to be efficient.)
Thanks a lot in advance!
You can use regular python lists (which are dynamic) as vectors. Trivial example follows.
from scipy.spatial.distance import sqeuclidean
a = [1,2,3]
b = [0,0,0]
print sqeuclidean(a,b) # 14
As per aganders3's suggestion, do note that you can also use numpy arrays if needed:
import numpy
a = numpy.array([1,2,3])
If the sparse part of your question is crucial I'd use scipy for that - it has support for sparse matrixes. You can define a 1xn matrix and use it as a vector. This works (the parameter is the size of the matrix, filled with zeroes by default):
sqeuclidean(scipy.sparse.coo_matrix((1,3)),scipy.sparse.coo_matrix((1,3))) # 0
There are many kinds of sparse matrixes, some dictionary based (see comment). You can define a row sparse matrix from a list like this:
scipy.sparse.csr_matrix([1,2,3])
Here is how you can do it in numpy:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([0, 0, 0])
c = np.sum(((a - b) ** 2)) # 14

Mean subtraction of patches in python nympy scipy

I have a numpy array of 3 dimension, it's a grid of patches of 8x8 images.
What is the best way to subtract from each patch it's average, in other words each patch has a unique mean and I want to subtract it. I tried the following with no success obviously because both arrays are not equal in shape
patches=- patches.mean(axis = 2).mean(axis = 1)
I thought of using the repeat function, something like:
patches=- np.repeat(np.repeat(patches.mean(axis =2).mean(axis =1).reshape((n_patches, 8, 8)), 1, 1))
Put I think that following this route would lead to an inefficient solution. Any thoughts or solution on this?
import numpy as np
a = np.random.rand(10,8,8)
mean = a.mean(axis=2).mean(axis=1)
b = a - mean[:, np.newaxis, np.newaxis] # reshape the mean as (10, 1, 1)
I think you are looking for broadcasting:
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

Categories

Resources