>>> import numpy as np
>>> X = np.arange(27).reshape(3, 3, 3)
>>> x = [0, 1]
>>> X[x, x, :]
array([[ 0, 1, 2],
[12, 13, 14]])
I need to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout. Hence I would like the result to be transposed:
array([[ 0, 12],
[ 1, 13],
[ 2, 14]])
How do I do that? I would like the result of numpy's "advanced indexing" to be implicitly transposed. Transposing it explicitly with .T at the end is even slower and is not an option.
Update1: in the real world advanced indexing is unavoidable and the subscripts are not guaranteed to be the same.
>>> x = [0, 0, 1]
>>> y = [0, 1, 1]
>>> X[x, y, :]
array([[ 0, 1, 2],
[ 3, 4, 5],
[12, 13, 14]])
Update2: To clarify that this is not an XY problem, here is the actual problem:
I have a large matrix X which contains elements x coming from some probability distribution. The probability distribution of the element depends on the neighbourhood of the element. This distribution is unknown so I follow the Gibbs sampling procedure to build a matrix which has elements from this distribution. In a nutshell it means that I make some initial guess for matrix X and then I keep iterating over the elements of matrix X updating each element x with a formula that depends on the neighbouring values of x. So, for any element of a matrix I need to get its neighbours (advanced indexing) and perform some operation on them (summation in my example). I have used line_profiler to see that the line which takes most of the time in my code is taking the sum of an array with respect to dimension 0 rather than -1. Hence I would like to know if there is a way to produce an already-transposed matrix as a result of advanced indexing.
I would like to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout.
I'm not totally sure what you mean by this. If the underlying array is row-major (the default, i.e. X.flags.c_contiguous == True), then it may be slightly faster to sum it along the 0th dimension. Simply transposing an array using .T or np.transpose() does not, in itself, change how the array is laid out in memory.
For example:
# X is row-major
print(X.flags.c_contiguous)
# True
# Y is just a transposed view of X
Y = X.T
# the indices of the elements in Y are transposed, but their layout in memory
# is the same as in X, therefore Y is column-major rather than row-major
print(Y.flags.c_contiguous)
# False
You can convert from row-major to column-major, for example by using np.asfortranarray(X), but there is no way to perform this conversion without making a full copy of X in memory. Unless you're going to be performing lots of operations over the columns of X then it almost certainly won't be worthwhile doing the conversion.
If you want to store the result of your summation in a column-major array, you could use the out= kwarg to X.sum(), e.g.:
result = np.empty((3, 3), order='F') # Fortran-order, i.e. column-major
X.sum(0, out=result)
In your case the difference between summing over rows vs columns is likely to be very minimal, though - since you are already going to be indexing non-adjacent elements in X you will already be losing the benefit of spatial locality of reference that would normally make summing over rows slightly faster.
For example:
X = np.random.randn(100, 100, 100)
# summing over whole rows is slightly faster than summing over whole columns
%timeit X.sum(0)
# 1000 loops, best of 3: 438 µs per loop
%timeit X.T.sum(0)
# 1000 loops, best of 3: 486 µs per loop
# however, the locality advantage disappears when you are addressing
# non-adjacent elements using fancy indexing
%timeit X[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.72 µs per loop
%timeit X.T[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.63 µs per loop
Update
#senderle has mentioned in the comments that using numpy v1.6.2 he sees the opposite order for the timing, i.e. X.sum(-1) is faster than X.sum(0) for a row-major array. This seems to be related to the version of numpy he is using - using v1.6.2 I can reproduce the order that he observes, but using two newer versions (v1.8.2 and 1.10.0.dev-8bcb756) I observe the opposite (i.e. X.sum(0) is faster than X.sum(-1) by a small margin). Either way, I don't think it's likely that changing the memory order of the array is likely to help much for the OP's case.
Related
Let's suppose I have two arrays that represent pixels in pictures.
I want to build an array of tensordot products of pixels of a smaller picture with a bigger picture as it "scans" the latter. By "scanning" I mean iteration over rows and columns while creating overlays with the original picture.
For instance, a 2x2 picture can be overlaid on top of 3x3 in four different ways, so I want to produce a four-element array that contains tensordot products of matching pixels.
Tensordot is calculated by multiplying a[i,j] with b[i,j] element-wise and summing the terms.
Please examine this code:
import numpy as np
a = np.array([[0,1,2],
[3,4,5],
[6,7,8]])
b = np.array([[0,1],
[2,3]])
shape_diff = (a.shape[0] - b.shape[0] + 1,
a.shape[1] - b.shape[1] + 1)
def compute_pixel(x,y):
sub_matrix = a[x : x + b.shape[0],
y : y + b.shape[1]]
return np.tensordot(sub_matrix, b, axes=2)
def process():
arr = np.zeros(shape_diff)
for i in range(shape_diff[0]):
for j in range(shape_diff[1]):
arr[i,j]=compute_pixel(i,j)
return arr
print(process())
Computing a single pixel is very easy, all I need is the starting location coordinates within a. From there I match the size of the b and do a tensordot product.
However, because I need to do this all over again for each x and y location as I'm iterating over rows and columns I've had to use a loop, which is of course suboptimal.
In the next piece of code I have tried to utilize a handy feature of tensordot, which also accepts tensors as arguments. In order words I can feed an array of arrays for different combinations of a, while keeping the b the same.
Although in order to create an array of said combination, I couldn't think of anything better than using another loop, which kind of sounds silly in this case.
def try_vector():
tensor = np.zeros(shape_diff + b.shape)
for i in range(shape_diff[0]):
for j in range(shape_diff[1]):
tensor[i,j]=a[i: i + b.shape[0],
j: j + b.shape[1]]
return np.tensordot(tensor, b, axes=2)
print(try_vector())
Note: tensor size is the sum of two tuples, which in this case gives (2, 2, 2, 2)
Yet regardless, even if I produced such array, it would be prohibitively large in size to be of any practical use. For doing this for a 1000x1000 picture, could probably consume all the available memory.
So, is there any other ways to avoid loops in this problem?
In [111]: process()
Out[111]:
array([[19., 25.],
[37., 43.]])
tensordot with 2 is the same as element multiply and sum:
In [116]: np.tensordot(a[0:2,0:2],b, axes=2)
Out[116]: array(19)
In [126]: (a[0:2,0:2]*b).sum()
Out[126]: 19
A lower-memory way of generating your tensor is:
In [121]: np.lib.stride_tricks.sliding_window_view(a,(2,2))
Out[121]:
array([[[[0, 1],
[3, 4]],
[[1, 2],
[4, 5]]],
[[[3, 4],
[6, 7]],
[[4, 5],
[7, 8]]]])
We can do a broadcasted multiply, and sum on the last 2 axes:
In [129]: (Out[121]*b).sum((2,3))
Out[129]:
array([[19, 25],
[37, 43]])
I have been reading in multiple places (e.g. here) that numpy.append() should never be used.
For example, if one wants to stack multiple arrays together, it is much better to do so via an intermediate Python list:
import numpy as np
def stacker(arrs):
result = arrs[0][None, ...]
for arr in arrs[1:]:
result = np.append(result, arr[None, ...], 0)
return result
n = 1000
shape = (100, 100)
x = [np.random.randint(0, n, shape) for _ in range(n)]
%timeit np.array(x)
# 100 loops, best of 3: 17.6 ms per loop
%timeit np.concatenate([arr[None, ...] for arr in x])
# 100 loops, best of 3: 17.7 ms per loop
%timeit np.stack(x)
# 100 loops, best of 3: 18.3 ms per loop
%timeit stacker(x)
# 1 loop, best of 3: 12.5 s per loop
I understand that np.append() creates a copy of both its NumPy array inputs and this is much more inefficient than list.append() or list.extend() in this use-case. However, I find it hard to believe that NumPy developers just added a useless function.
So, what is the use-case for numpy.append()?
Look at its code:
arr = asanyarray(arr)
if axis is None:
if arr.ndim != 1:
arr = arr.ravel()
values = ravel(values)
axis = arr.ndim-1
return concatenate((arr, values), axis=axis)
It's just a simple interface to concatenate. With axis it's a direct call to concatenate. Without it it ravels the inputs, which often causes a problem. And it converts scalars to arrays.
If you have a 1d array, then it is an easy way to add one value:
In [8]: np.append(np.arange(3), 10)
Out[8]: array([ 0, 1, 2, 10])
but hstack is just as nice:
In [10]: np.hstack([np.arange(3), 10])
Out[10]: array([ 0, 1, 2, 10])
People write functions that seem to be a good idea at the time, usually with a specific use in mind. But the actual use (and misuses) may be different than anticipated.
np.stack is a more recent, and useful addition.
For a while there was a note in the docs urging us to use concatenate and stack and avoid all the other stack's, but that's been toned down. Now they just have:
This function makes most sense for arrays with up to 3 dimensions. For
instance, for pixel-data with a height (first axis), width (second axis),
and r/g/b channels (third axis). The functions concatenate, stack and
block provide more general stacking and concatenation operations.
I have a problem vectorizing some code in pytorch.
A numpy solution would also help, but a pytorch solution would be better.
I'm going to use array and Tensor interchangeably.
The problem I am facing is this:
Given an 2D float array X of size (n, x), and a boolean 2D array A of size (n, n), compute the mean over rows in X indexed by rows in A.
The problem is that the rows in A contain a variable number of True indices.
Example (numpy):
import numpy as np
A = np.array([[0, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0],
[0, 1, 1, 1, 0, 0]])
X = np.arange(6 * 3, dtype=np.float32).reshape(6, 3)
# Compute the mean in numpy with a for loop
means_np = np.array([X[A.astype(np.bool)[i]].mean(axis=0) for i in np.arange(len(A)])
So this examples works, but this formulation has 3 problems:
The for loop is slow for larger A and X. I need to loop over a few 10 thousand indices.
It can happen that A[i] contains no True indices. This results in np.mean(np.array([])), which is NaN. I want this to be 0 instead.
Implementing it this way in pytorch results in SIGFPE (Floating point error) during the backwards pass of backpropagation through this function. The cause is when nothing being selected.
The workaround that I am using now is (also see code below):
Set the diagonal elements of A to True so that there is always at least one element to select
sum of all selected elements, subtract the values in X from that sum (the diagonal is guaranteed to be False in the beginning), and divide by the number of True elements - 1 clamped to at least 1 in each row.
This works, is differentiable in pytorch and does not produce NaN, but I still need a loop over all indices.
How can I get rid of this loop?
This is my current pytorch code:
import torch
A = torch.from_numpy(A).bytes()
X = torch.from_numpy(X)
A[np.diag_indices(len(A)] = 1 # Set the diagonal to 1
means = [(X[A[i]].sum(dim=0) - X[i]) / torch.clamp(A[i].sum() - 1, min=1.) # Compute the mean safely
for i in range(len(A))] # Get rid of the loop somehow
means = torch.stack(means)
I don't mind if your version looks completely different, as long as it is differentiable and produces the same result.
We can leverage matrix-multiplication -
c = A.sum(1,keepdims=True)
means_np = np.where(c==0,0,A.dot(X)/c)
We can optimize it further by converting A to float32 dtype if it's not already so and if the loss of precision is okay there, as shown below -
In [57]: np.random.seed(0)
In [58]: A = np.random.randint(0,2,(1000,1000))
In [59]: X = np.random.rand(1000,1000).astype(np.float32)
In [60]: %timeit A.dot(X)
10 loops, best of 3: 27 ms per loop
In [61]: %timeit A.astype(np.float32).dot(X)
100 loops, best of 3: 10.2 ms per loop
In [62]: np.allclose(A.dot(X), A.astype(np.float32).dot(X))
Out[62]: True
Thus, use A.astype(np.float32).dot(X) to replace A.dot(X).
Alternatively, to solve for the case where the row-sum is zero, and that requires us to use np.where, we could assign any non-zero value, say 1 into c and then simply divide by it, like so -
c = A.sum(1,keepdims=True)
c[c==0] = 1
means_np = A.dot(X)/c
This would also avoid the warning that we would otherwise get from np.where in those zero row sum cases.
I'd like to know how I might be able to transform this problem to reduce the overhead of the np.sum() function calls in my code.
I have an input matrix, say of shape=(1000, 36). Each row represents a node in a graph. I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. Those "other" rows are defined in a dictionary nodes_nbrs that records, for each row, a list of rows that must be summed together. An example is as such:
nodes_nbrs = {0: [0, 1],
1: [1, 0, 2],
2: [2, 1],
...}
Here, node 0 would be transformed into the sum of nodes 0 and 1. Node 1 would be transformed into the sum of nodes 1, 0, and 2. And so on for the rest of the nodes.
The current (and naive) way I currently have implemented is as such. I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the nodes_nbrs dictionary:
output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
output[k] = np.sum(input[v], axis=0)
This code is all cool and fine in small tests (shape=(1000, 36)), but on larger tests (shape=(~1E(5-6), 36)), it takes ~2-3 seconds to complete. I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this.
After doing line profiling, I noticed that the key killer here is calling the np.sum function over and over, which takes about 50% of the total time. Is there a way I can eliminate this overhead? Or is there another way I can optimize this?
Apart from that, here is a list of things I have done, and (very briefly) their results:
A cython version: eliminates the for loop type checking overhead, 30% reduction in time taken. With the cython version, np.sum takes about 80% of the total wall clock time, rather than 50%.
Pre-declare np.sum as a variable npsum, and then call npsum inside the for loop. No difference with original.
Replace np.sum with np.add.reduce, and assign that to the variable npsum, and then call npsum inside the for loop. ~10% reduction in wall clock time, but then incompatible with autograd (explanation below in sparse matrices bullet point).
numba JIT-ing: did not attempt more than adding decorator. No improvement, but didn't try harder.
Convert the nodes_nbrs dictionary into a dense numpy binary array (1s and 0s), and then do a single np.dot operation. Good in theory, bad in practice because it would require a square matrix of shape=(10^n, 10^n), which is quadratic in memory usage.
Things I have not tried, but am hesitant to do so:
scipy sparse matrices: I am using autograd, which does not support automatic differentiation of the dot operation for scipy sparse matrices.
For those who are curious, this is essentially a convolution operation on graph-structured data. Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge.
If scipy.sparse is not an option, one way you might approach this would be to massage your data so that you can use vectorized functions to do everything in the compiled layer. If you change your neighbors dictionary into a two-dimensional array with appropriate flags for missing values, you can use np.take to extract the data you want and then do a single sum() call.
Here's an example of what I have in mind:
import numpy as np
def make_data(N=100):
X = np.random.randint(1, 20, (N, 36))
connections = np.random.randint(2, 5, N)
nbrs = {i: list(np.random.choice(N, c))
for i, c in enumerate(connections)}
return X, nbrs
def original_solution(X, nbrs):
output = np.zeros(shape=X.shape)
for k, v in nbrs.items():
output[k] = np.sum(X[v], axis=0)
return output
def vectorized_solution(X, nbrs):
# Make neighbors all the same length, filling with -1
new_nbrs = np.full((X.shape[0], max(map(len, nbrs.values()))), -1, dtype=int)
for i, v in nbrs.items():
new_nbrs[i, :len(v)] = v
# add a row of zeros to X
new_X = np.vstack([X, 0 * X[0]])
# compute the sums
return new_X.take(new_nbrs, 0).sum(1)
Now we can confirm that the results match:
>>> X, nbrs = make_data(100)
>>> np.allclose(original_solution(X, nbrs),
vectorized_solution(X, nbrs))
True
And we can time things to see the speedup:
X, nbrs = make_data(1000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
# 100 loops, best of 3: 13.7 ms per loop
# 100 loops, best of 3: 1.89 ms per loop
Going up to larger sizes:
X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
1 loop, best of 3: 1.42 s per loop
1 loop, best of 3: 249 ms per loop
It's about a factor of 5-10 faster, which may be good enough for your purposes (though this will heavily depend on the exact characteristics of your nbrs dictionary).
Edit: Just for fun, I tried a couple other approaches, one using numpy.add.reduceat, one using pandas.groupby, and one using scipy.sparse. It seems that the vectorized approach I originally proposed above is probably the best bet. Here they are for reference:
from itertools import chain
def reduceat_solution(X, nbrs):
ind, j = np.transpose([[i, len(v)] for i, v in nbrs.items()])
i = list(chain(*(nbrs[i] for i in ind)))
j = np.concatenate([[0], np.cumsum(j)[:-1]])
return np.add.reduceat(X[i], j)[ind]
np.allclose(original_solution(X, nbrs),
reduceat_solution(X, nbrs))
# True
-
import pandas as pd
def groupby_solution(X, nbrs):
i, j = np.transpose([[k, vi] for k, v in nbrs.items() for vi in v])
return pd.groupby(pd.DataFrame(X[j]), i).sum().values
np.allclose(original_solution(X, nbrs),
groupby_solution(X, nbrs))
# True
-
from scipy.sparse import csr_matrix
from itertools import chain
def sparse_solution(X, nbrs):
items = (([i]*len(col), col, [1]*len(col)) for i, col in nbrs.items())
rows, cols, data = (np.array(list(chain(*a))) for a in zip(*items))
M = csr_matrix((data, (rows, cols)))
return M.dot(X)
np.allclose(original_solution(X, nbrs),
sparse_solution(X, nbrs))
# True
And all the timings together:
X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
%timeit reduceat_solution(X, nbrs)
%timeit groupby_solution(X, nbrs)
%timeit sparse_solution(X, nbrs)
# 1 loop, best of 3: 1.46 s per loop
# 1 loop, best of 3: 268 ms per loop
# 1 loop, best of 3: 416 ms per loop
# 1 loop, best of 3: 657 ms per loop
# 1 loop, best of 3: 282 ms per loop
Based on work on recent sparse questions, e.g. Extremely slow sum row operation in Sparse LIL matrix in Python
here's how your sort of problem could be solved with sparse matrices. The method might apply just as well to dense ones. The idea is that sparse sum implemented as matrix product with a row (or column) of 1s. Indexing of sparse matrices is slow, but the matrix product is good C code.
In this case I'm going to build a multiplication matrix that has 1s for the rows that I want to sum - different set of 1s for each entry in the dictionary.
A sample matrix:
In [302]: A=np.arange(8*3).reshape(8,3)
In [303]: M=sparse.csr_matrix(A)
selection dictionary:
In [304]: dict={0:[0,1],1:[1,0,2],2:[2,1],3:[3,4,7]}
build a sparse matrix from this dictionary. This might not be the most efficient way of constructing such a matrix, but it is enough to demonstrate the idea.
In [305]: r,c,d=[],[],[]
In [306]: for i,col in dict.items():
c.extend(col)
r.extend([i]*len(col))
d.extend([1]*len(col))
In [307]: r,c,d
Out[307]:
([0, 0, 1, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 1, 0, 2, 2, 1, 3, 4, 7],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
In [308]: idx=sparse.csr_matrix((d,(r,c)),shape=(len(dict),M.shape[0]))
Perform the sum and look at the result (as a dense array):
In [310]: (idx*M).A
Out[310]:
array([[ 3, 5, 7],
[ 9, 12, 15],
[ 9, 11, 13],
[42, 45, 48]], dtype=int32)
Here's the original for comparison.
In [312]: M.A
Out[312]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[21, 22, 23]], dtype=int32)
I have a large 2D numpy matrix that needs to be made smaller (ex: convert from 100x100 to 10x10).
My goal is essentially: break the nxn matrix into smaller mxm matrices, average the cells in these mxm slices, and then construct a new (smaller) matrix out of these mxm slices.
I'm thinking about using something like matrix[a::b, c::d] to extract the smaller matrices, and then averaging those values, but this seems overly complex. Is there a better way to accomplish this?
You could split your array into blocks with the view_as_blocks function (in scikit-image).
For a 2D array, this returns a 4D array with the blocks ordered row-wise:
>>> import skimage.util as ski
>>> import numpy as np
>>> a = np.arange(16).reshape(4,4) # 4x4 array
>>> ski.view_as_blocks(a, (2,2))
array([[[[ 0, 1],
[ 4, 5]],
[[ 2, 3],
[ 6, 7]]],
[[[ 8, 9],
[12, 13]],
[[10, 11],
[14, 15]]]])
Taking the mean along the last two axes returns a 2D array with the mean in each block:
>>> ski.view_as_blocks(a, (2,2)).mean(axis=(2,3))
array([[ 2.5, 4.5],
[ 10.5, 12.5]])
Note: view_as_blocks returns a view of the array by modifying the strides (it also works with arrays with more than two dimensions). It is implemented purely in NumPy using as_strided, so if you don't have access to the scikit-image library you can copy the code from here.
Without ski-learn, you can simply reshape, and take the appropriate mean.
M=np.arange(10000).reshape(100,100)
M1=M.reshape(10,10,10,10)
M2=M1.mean(axis=(1,3))
quick check to see if I got the right axes
In [127]: M2[0,0]
Out[127]: 454.5
In [128]: M[:10,:10].mean()
Out[128]: 454.5
In [131]: M[-10:,-10:].mean()
Out[131]: 9544.5
In [132]: M2[-1,-1]
Out[132]: 9544.5
Adding .transpose([0,2,1,3]) puts the 2 averaging dimensions at the end, as view_as_blocks does.
For this (100,100) case, the reshape approach is 2x faster than the as_strided approach, but both are quite fast.
However the direct strided solution isn't much slower than reshaping.
as_strided(M,shape=(10,10,10,10),strides=(8000,80,800,8)).mean((2,3))
as_strided(M,shape=(10,10,10,10),strides=(8000,800,80,8)).mean((1,3))
I'm coming in late but I'd recommend scipy.ndimage.zoom() as an off-the-shelf solution for this. It does down-sizing (or upsizing) using spline interpolations of arbitrary order from 0 to 5. Sounds like order 0 would be sufficient for you based on your question.
from scipy import ndimage as ndi
import numpy as np
M=np.arange(1000000).reshape(1000,1000)
shrinkby=10
Mfilt = ndi.filters.uniform_filter(input=M, size=shrinkby)
Msmall = ndi.interpolation.zoom(input=Mfilt, zoom=1./shrinkby, order=0)
That's all you need. It's perhaps slightly less convenient to specify a zoom rather than a desired output size, but at least for order=0 this method is very fast.
The output size is 10% of the input in each dimension, i.e.
print M.shape, Msmall.shape
gives (1000, 1000) (100, 100) and the speed you can get from
%timeit Mfilt = ndi.filters.uniform_filter(input=M, size=shrinkby)
%timeit Msmall = ndi.interpolation.zoom(input=Mfilt, zoom=1./shrinkby, order=0)
which on my machine gave 10 loops, best of 3: 20.5 ms per loop for the uniform_filter call and 1000 loops, best of 3: 1.67 ms per loop for the zoom call.