Efficient NumPy rows rotation over variable distances - python

Given a 2D M x N NumPy array and a list of rotation distances, I want to rotate all M rows over the distances in the list. This is what I currently have:
import numpy as np
M = 6
N = 8
dists = [2,0,2,1,4,2] # for example
matrix = np.random.randint(0,2,(M,N))
for i in range(M):
matrix[i] = np.roll(matrix[i], -dists[i])
The last two lines are actually part of an inner loop that gets executed hundreds of thousands of times and it is bottlenecking my performance as measured by cProfile. Is it possible to, for instance, avoid the for-loop and to do this more efficiently?

We can simulate the rolling behaviour with modulus operation after adding dists with a range(0...N) array to give us column indices for each row from where elements are to be picked and shuffled in the same row. We can vectorize this process across all rows with the help of broadcasting. Thus, we would have an implementation like so -
M,N = matrix.shape # Store matrix shape
# Get column indices for all elems for a rolled version with modulus operation
col_idx = np.mod(np.arange(N) + dists[:,None],N)
# Index into matrix with ranged row indices and col indices to get final o/p
out = matrix[np.arange(M)[:,None],col_idx]

Related

How can I create a numpy matrix from another?

I have a numpy matrix with 10 columns and 4 rows filled only with 0 and 1.
I want to modify only the first row using slices.
mat = [[0,1,1,1,0,0,0,0,0,1]
[0,0,0,0,0,0,0,0,0,0]
[1,1,1,1,1,1,1,1,1,1]
[0,0,0,0,0,0,0,0,0,0]]
slices like a point in the middle so the row becomes
[0,0,0,0,1,0,1,1,1,0]
and the rest of the matrix is equal to the previous one.
I tried to use these slices but I don't know how to put them together.
mat1 = mat[0,0:5]
mat2 = mat[0,5:10]

How to take the average value of respective elements in arrays

I have a chunk of code that runs 1000 times and produces 1000 covariance matrices. How do I calculate the average value for each element in the matrices and then print that average matrix?
params_avg1=[]
pcov1avg=[]
i=1000
for n in range(i):
y3=y2+np.random.normal(loc=0.0,scale=.1*y2)
popt1,pcov1=optimize.curve_fit(fluxmeasureMW,bands,y3)
params_avg1.append(popt1)
pcov1avg.append(pcov1) #returns an array of 1000 3x3 covariance matrices
As you already appended all your matrices into a single array, transform it to a 3D numpy array and then average on the correct axis:
np.array(pcov1avg).mean(axis=0) # or equivalently np.mean(pcov1avg, 0)
And just a bit about naming - i usually denotes the current index of the iteration rather than the end value, usually denoted with n

subsetting numpy array to rows within a d-dimensional hypercube

I have a numpy array of shape n x d. Each row represents a point in R^d. I want to filter this array to only rows within a given distance on each axis of a single point--a d-dimensional hypercube, as it were.
In 1 dimension, this could be:
array[np.which(array < lmax and array > lmin)]
where lmax and lmin are the max and min relevant to the point+-distance. But I want to do this in d dimensions. d is not fixed, so hard-coding it out doesn't work. I checked to see if the above works where lmax and lmin are d-length vectors, but it just flattens the array.
I know I could plug the matrix and the point into a distance calculator like scipy.spatial.distance and get some sort of distance metric, but that's likely slower than some simple filtering (if it exists) would be.
The fact I have to do this calculation potentially millions of times means Ideally I'd like a fast solution.
You can try this.
def test(array):
large = array > lmin
small = array < lmax
return array[[i for i in range(array.shape[0])
if np.all(large[i]) and np.all(small[i])]]
For every i, array[i] is a vector. All the elements of a vector should be in range [lmin, lmax], and this process of calculation can be vectorized.

How can I find the percentage of times that a point in a 2D numpy array is greater than the corresponding point in a reference array, for all points?

Python newbie here. Given a list (mylist) of 2D numpy arrays, and a reference 2D numpy array (refarr), I'm looking for a way to create an array (percentage_array) in which each point (i,j) is the percentage of corresponding (i,j) points in the mylist arrays that are greater than [i,j] in refarr. I could do this by looping through all the points in the array and though the list, e.g.:
percentage_array = numpy.empty(arr.shape)
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
t = 0
f = 0
for arr in mylist:
if arr[i,j] > refarr[i,j]:
t += 1 # counting the number of arrays for which [i,j] is true
elif arr[i,j] <= refarr[i,j]:
f += 1 # counting the number of arrays for which [i,j] is false
percentage_array[i,j] = t/(t+f) # fraction of arrays which have
# arr[i,j] > refarr[i,j]
...but this is neither quick nor elegant (I'm dealing with large amounts of data). Are there better ways to do this?
You can create a 3d array with
a = np.array(myList)
Then you can compare this array to your original one using broadcasting:
a < refarr # gives a 3D array of booleans because refarr is broadcasted to a 3D array
and to count the percentage of values where the condition is met, you average on the first axis:
(a < refarr[None, :, :]).mean(axis = 0)
The main drawback of this approach is that you have to create an array a which may be big. Otherwise I'd consider treating arrays one by one.

Optimize Scipy Sparse Matrix

I have a sparse matrix where I'm currently enumerating over each row and performing some calculations based on the information from each row. Each row is completely independent of the others. However, for large matrices, this code is extremely slow (takes about 2 hours) and I can't convert the matrix to a dense one either (limited to 8GB RAM).
import scipy.sparse
import numpy as np
def process_row(a, b):
"""
a - contains the row indices for a sparse matrix
b - contains the column indices for a sparse matrix
Returns a new vector of length(a)
"""
return
def assess(mat):
"""
"""
mat_csr = mat.tocsr()
nrows, ncols = mat_csr.shape
a = np.arange(ncols, dtype=np.int32)
b = np.empty(ncols, dtype=np.int32)
result = []
for i, row in enumerate(mat_csr):
# Process one row at a time
b.fill(i)
result.append(process_row(b, a))
return result
if __name__ == '__main__':
row = np.array([8,2,7,4])
col = np.array([1,3,2,1])
data = np.array([1,1,1,1])
mat = scipy.sparse.coo_matrix((data, (row, col)))
print assess(mat)
I am looking to see if there's any way to design this better so that it performs much faster. Essentially, the process_row function takes (row, col) index pairs (from a, b) and does some math using another sparse matrix and returns a result. I don't have the option to change this function but it can actually process different row/col pairs and is not restricted to processing everything from the same row.
Your problem looks similar to this other recent SO question:
Calculate the euclidean distance in scipy csr matrix
In my answer I sketched a way of iterating over the rows of a sparse matrix. I think it is faster to convert the array to lil, and construct the dense rows directly from its sublists. This avoids the overhead of creating a new sparse matrix for each row. But I haven't done time tests.
https://stackoverflow.com/a/36559702/901925
Maybe this applies to your case.

Categories

Resources