Code optimization - number of function calls in Python - python

I'd like to know how I might be able to transform this problem to reduce the overhead of the np.sum() function calls in my code.
I have an input matrix, say of shape=(1000, 36). Each row represents a node in a graph. I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. Those "other" rows are defined in a dictionary nodes_nbrs that records, for each row, a list of rows that must be summed together. An example is as such:
nodes_nbrs = {0: [0, 1],
1: [1, 0, 2],
2: [2, 1],
...}
Here, node 0 would be transformed into the sum of nodes 0 and 1. Node 1 would be transformed into the sum of nodes 1, 0, and 2. And so on for the rest of the nodes.
The current (and naive) way I currently have implemented is as such. I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the nodes_nbrs dictionary:
output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
output[k] = np.sum(input[v], axis=0)
This code is all cool and fine in small tests (shape=(1000, 36)), but on larger tests (shape=(~1E(5-6), 36)), it takes ~2-3 seconds to complete. I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this.
After doing line profiling, I noticed that the key killer here is calling the np.sum function over and over, which takes about 50% of the total time. Is there a way I can eliminate this overhead? Or is there another way I can optimize this?
Apart from that, here is a list of things I have done, and (very briefly) their results:
A cython version: eliminates the for loop type checking overhead, 30% reduction in time taken. With the cython version, np.sum takes about 80% of the total wall clock time, rather than 50%.
Pre-declare np.sum as a variable npsum, and then call npsum inside the for loop. No difference with original.
Replace np.sum with np.add.reduce, and assign that to the variable npsum, and then call npsum inside the for loop. ~10% reduction in wall clock time, but then incompatible with autograd (explanation below in sparse matrices bullet point).
numba JIT-ing: did not attempt more than adding decorator. No improvement, but didn't try harder.
Convert the nodes_nbrs dictionary into a dense numpy binary array (1s and 0s), and then do a single np.dot operation. Good in theory, bad in practice because it would require a square matrix of shape=(10^n, 10^n), which is quadratic in memory usage.
Things I have not tried, but am hesitant to do so:
scipy sparse matrices: I am using autograd, which does not support automatic differentiation of the dot operation for scipy sparse matrices.
For those who are curious, this is essentially a convolution operation on graph-structured data. Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge.

If scipy.sparse is not an option, one way you might approach this would be to massage your data so that you can use vectorized functions to do everything in the compiled layer. If you change your neighbors dictionary into a two-dimensional array with appropriate flags for missing values, you can use np.take to extract the data you want and then do a single sum() call.
Here's an example of what I have in mind:
import numpy as np
def make_data(N=100):
X = np.random.randint(1, 20, (N, 36))
connections = np.random.randint(2, 5, N)
nbrs = {i: list(np.random.choice(N, c))
for i, c in enumerate(connections)}
return X, nbrs
def original_solution(X, nbrs):
output = np.zeros(shape=X.shape)
for k, v in nbrs.items():
output[k] = np.sum(X[v], axis=0)
return output
def vectorized_solution(X, nbrs):
# Make neighbors all the same length, filling with -1
new_nbrs = np.full((X.shape[0], max(map(len, nbrs.values()))), -1, dtype=int)
for i, v in nbrs.items():
new_nbrs[i, :len(v)] = v
# add a row of zeros to X
new_X = np.vstack([X, 0 * X[0]])
# compute the sums
return new_X.take(new_nbrs, 0).sum(1)
Now we can confirm that the results match:
>>> X, nbrs = make_data(100)
>>> np.allclose(original_solution(X, nbrs),
vectorized_solution(X, nbrs))
True
And we can time things to see the speedup:
X, nbrs = make_data(1000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
# 100 loops, best of 3: 13.7 ms per loop
# 100 loops, best of 3: 1.89 ms per loop
Going up to larger sizes:
X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
1 loop, best of 3: 1.42 s per loop
1 loop, best of 3: 249 ms per loop
It's about a factor of 5-10 faster, which may be good enough for your purposes (though this will heavily depend on the exact characteristics of your nbrs dictionary).
Edit: Just for fun, I tried a couple other approaches, one using numpy.add.reduceat, one using pandas.groupby, and one using scipy.sparse. It seems that the vectorized approach I originally proposed above is probably the best bet. Here they are for reference:
from itertools import chain
def reduceat_solution(X, nbrs):
ind, j = np.transpose([[i, len(v)] for i, v in nbrs.items()])
i = list(chain(*(nbrs[i] for i in ind)))
j = np.concatenate([[0], np.cumsum(j)[:-1]])
return np.add.reduceat(X[i], j)[ind]
np.allclose(original_solution(X, nbrs),
reduceat_solution(X, nbrs))
# True
-
import pandas as pd
def groupby_solution(X, nbrs):
i, j = np.transpose([[k, vi] for k, v in nbrs.items() for vi in v])
return pd.groupby(pd.DataFrame(X[j]), i).sum().values
np.allclose(original_solution(X, nbrs),
groupby_solution(X, nbrs))
# True
-
from scipy.sparse import csr_matrix
from itertools import chain
def sparse_solution(X, nbrs):
items = (([i]*len(col), col, [1]*len(col)) for i, col in nbrs.items())
rows, cols, data = (np.array(list(chain(*a))) for a in zip(*items))
M = csr_matrix((data, (rows, cols)))
return M.dot(X)
np.allclose(original_solution(X, nbrs),
sparse_solution(X, nbrs))
# True
And all the timings together:
X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
%timeit reduceat_solution(X, nbrs)
%timeit groupby_solution(X, nbrs)
%timeit sparse_solution(X, nbrs)
# 1 loop, best of 3: 1.46 s per loop
# 1 loop, best of 3: 268 ms per loop
# 1 loop, best of 3: 416 ms per loop
# 1 loop, best of 3: 657 ms per loop
# 1 loop, best of 3: 282 ms per loop

Based on work on recent sparse questions, e.g. Extremely slow sum row operation in Sparse LIL matrix in Python
here's how your sort of problem could be solved with sparse matrices. The method might apply just as well to dense ones. The idea is that sparse sum implemented as matrix product with a row (or column) of 1s. Indexing of sparse matrices is slow, but the matrix product is good C code.
In this case I'm going to build a multiplication matrix that has 1s for the rows that I want to sum - different set of 1s for each entry in the dictionary.
A sample matrix:
In [302]: A=np.arange(8*3).reshape(8,3)
In [303]: M=sparse.csr_matrix(A)
selection dictionary:
In [304]: dict={0:[0,1],1:[1,0,2],2:[2,1],3:[3,4,7]}
build a sparse matrix from this dictionary. This might not be the most efficient way of constructing such a matrix, but it is enough to demonstrate the idea.
In [305]: r,c,d=[],[],[]
In [306]: for i,col in dict.items():
c.extend(col)
r.extend([i]*len(col))
d.extend([1]*len(col))
In [307]: r,c,d
Out[307]:
([0, 0, 1, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 1, 0, 2, 2, 1, 3, 4, 7],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
In [308]: idx=sparse.csr_matrix((d,(r,c)),shape=(len(dict),M.shape[0]))
Perform the sum and look at the result (as a dense array):
In [310]: (idx*M).A
Out[310]:
array([[ 3, 5, 7],
[ 9, 12, 15],
[ 9, 11, 13],
[42, 45, 48]], dtype=int32)
Here's the original for comparison.
In [312]: M.A
Out[312]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[21, 22, 23]], dtype=int32)

Related

When to use `numpy.append()`?

I have been reading in multiple places (e.g. here) that numpy.append() should never be used.
For example, if one wants to stack multiple arrays together, it is much better to do so via an intermediate Python list:
import numpy as np
def stacker(arrs):
result = arrs[0][None, ...]
for arr in arrs[1:]:
result = np.append(result, arr[None, ...], 0)
return result
n = 1000
shape = (100, 100)
x = [np.random.randint(0, n, shape) for _ in range(n)]
%timeit np.array(x)
# 100 loops, best of 3: 17.6 ms per loop
%timeit np.concatenate([arr[None, ...] for arr in x])
# 100 loops, best of 3: 17.7 ms per loop
%timeit np.stack(x)
# 100 loops, best of 3: 18.3 ms per loop
%timeit stacker(x)
# 1 loop, best of 3: 12.5 s per loop
I understand that np.append() creates a copy of both its NumPy array inputs and this is much more inefficient than list.append() or list.extend() in this use-case. However, I find it hard to believe that NumPy developers just added a useless function.
So, what is the use-case for numpy.append()?
Look at its code:
arr = asanyarray(arr)
if axis is None:
if arr.ndim != 1:
arr = arr.ravel()
values = ravel(values)
axis = arr.ndim-1
return concatenate((arr, values), axis=axis)
It's just a simple interface to concatenate. With axis it's a direct call to concatenate. Without it it ravels the inputs, which often causes a problem. And it converts scalars to arrays.
If you have a 1d array, then it is an easy way to add one value:
In [8]: np.append(np.arange(3), 10)
Out[8]: array([ 0, 1, 2, 10])
but hstack is just as nice:
In [10]: np.hstack([np.arange(3), 10])
Out[10]: array([ 0, 1, 2, 10])
People write functions that seem to be a good idea at the time, usually with a specific use in mind. But the actual use (and misuses) may be different than anticipated.
np.stack is a more recent, and useful addition.
For a while there was a note in the docs urging us to use concatenate and stack and avoid all the other stack's, but that's been toned down. Now they just have:
This function makes most sense for arrays with up to 3 dimensions. For
instance, for pixel-data with a height (first axis), width (second axis),
and r/g/b channels (third axis). The functions concatenate, stack and
block provide more general stacking and concatenation operations.

How to vectorize computation of mean over specific set of indices given as matrix rows?

I have a problem vectorizing some code in pytorch.
A numpy solution would also help, but a pytorch solution would be better.
I'm going to use array and Tensor interchangeably.
The problem I am facing is this:
Given an 2D float array X of size (n, x), and a boolean 2D array A of size (n, n), compute the mean over rows in X indexed by rows in A.
The problem is that the rows in A contain a variable number of True indices.
Example (numpy):
import numpy as np
A = np.array([[0, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0],
[0, 1, 1, 1, 0, 0]])
X = np.arange(6 * 3, dtype=np.float32).reshape(6, 3)
# Compute the mean in numpy with a for loop
means_np = np.array([X[A.astype(np.bool)[i]].mean(axis=0) for i in np.arange(len(A)])
So this examples works, but this formulation has 3 problems:
The for loop is slow for larger A and X. I need to loop over a few 10 thousand indices.
It can happen that A[i] contains no True indices. This results in np.mean(np.array([])), which is NaN. I want this to be 0 instead.
Implementing it this way in pytorch results in SIGFPE (Floating point error) during the backwards pass of backpropagation through this function. The cause is when nothing being selected.
The workaround that I am using now is (also see code below):
Set the diagonal elements of A to True so that there is always at least one element to select
sum of all selected elements, subtract the values in X from that sum (the diagonal is guaranteed to be False in the beginning), and divide by the number of True elements - 1 clamped to at least 1 in each row.
This works, is differentiable in pytorch and does not produce NaN, but I still need a loop over all indices.
How can I get rid of this loop?
This is my current pytorch code:
import torch
A = torch.from_numpy(A).bytes()
X = torch.from_numpy(X)
A[np.diag_indices(len(A)] = 1 # Set the diagonal to 1
means = [(X[A[i]].sum(dim=0) - X[i]) / torch.clamp(A[i].sum() - 1, min=1.) # Compute the mean safely
for i in range(len(A))] # Get rid of the loop somehow
means = torch.stack(means)
I don't mind if your version looks completely different, as long as it is differentiable and produces the same result.
We can leverage matrix-multiplication -
c = A.sum(1,keepdims=True)
means_np = np.where(c==0,0,A.dot(X)/c)
We can optimize it further by converting A to float32 dtype if it's not already so and if the loss of precision is okay there, as shown below -
In [57]: np.random.seed(0)
In [58]: A = np.random.randint(0,2,(1000,1000))
In [59]: X = np.random.rand(1000,1000).astype(np.float32)
In [60]: %timeit A.dot(X)
10 loops, best of 3: 27 ms per loop
In [61]: %timeit A.astype(np.float32).dot(X)
100 loops, best of 3: 10.2 ms per loop
In [62]: np.allclose(A.dot(X), A.astype(np.float32).dot(X))
Out[62]: True
Thus, use A.astype(np.float32).dot(X) to replace A.dot(X).
Alternatively, to solve for the case where the row-sum is zero, and that requires us to use np.where, we could assign any non-zero value, say 1 into c and then simply divide by it, like so -
c = A.sum(1,keepdims=True)
c[c==0] = 1
means_np = A.dot(X)/c
This would also avoid the warning that we would otherwise get from np.where in those zero row sum cases.

Is there a faster implementation of the following code?

I have a one-dimensional numpy array, which is quite large in size. For each entry of the array, I need to produce a linearly spaced sub-array upto that entry value. Here is what I have as an example.
import numpy as np
a = np.array([2, 3])
b = np.array([np.linspace(0, i, 4) for i in a])
In this case there is linear space of size 4. The last statement in the above code involves a for loop which is rather slow if a is very large. Is there a trick to implement this in numpy itself?
You can phrase this as an outer product:
In [37]: a = np.arange(100000)
In [38]: %timeit np.array([np.linspace(0, i, 4) for i in a])
1 loop, best of 3: 1.3 s per loop
In [39]: %timeit np.outer(a, np.linspace(0, 1, 4))
1000 loops, best of 3: 1.44 ms per loop
The idea is to a take a unit linspace and then scale it separately by each element of a.
As you can see, this gives ~1000x speed up for n=100000.
For completeness, I'll mention that this code has slightly different roundoff properties than your original version (likely not an issue in practical applications):
In [52]: np.max(np.abs(np.array([np.linspace(0, i, 4) for i in a]) -
...: np.outer(a, np.linspace(0, 1, 4))))
Out[52]: 1.4551915228366852e-11
P. S. An alternative way to express the idea is by using element-wise multiplication with broadcasting (based on a suggestion by #Scott Gigante):
In [55]: %timeit a[:, np.newaxis] * np.linspace(0, 1, 4)
1000 loops, best of 3: 1.48 ms per loop
P. P. S. See the comments below for further ideas on making this faster.

Efficient permutation of each row (or column) of a numpy array given a permutation matrix

Lets assume, I have two given ndarrays, where the matrix mapping contains information, of how row of the matrix mask should be permuted. We may assume, that the mapping matrix comes from some other algorithm.
import numpy as np
T, K, F = 2, 3, 5
mask = np.random.randint(4, size=(T, K, F))
mapping = np.asarray([
[0, 1, 2],
[0, 1, 2],
[2, 0, 1],
[0, 1, 2],
[1, 0, 2]
])
The straight forward way to do this, is by applying a for loop:
out = np.empty_like(mask)
for f in range(F):
out[:, :, f] = mask[:, mapping[f, :], f]
This seems to be quite efficient, so I looked at Numpy advanced indexing and found this solution:
out = mask[
np.arange(T)[:, None, None],
mapping.T[None, :, :],
np.arange(F)[None, None, :]
]
An answer to a related question suggests the use of ogrid:
ogrid = np.ogrid[:T, :1, :F]
out = mask[
ogrid[0],
mapping.T[None, :, :],
ogrid[2]
]
It seems very uncomfortable to create all the intermediate arrays and broadcast them correctly. So what is the best way, to perform the desired reordering?
Timing information:
To provide meaningful timing information, I used some shapes, closer to my application. The random permutation is just for brevity of the example.
T, K, F = 1000, 3, 257
mask = np.random.randint(4, size=(T, K, F))
mapping = np.stack([list(np.random.permutation(np.arange(3))) for _ in range(F)])
Here are the results:
for loop: 100 loops, best of 3: 8.4 ms per loop
three times broadcasting: 100 loops, best of 3: 8.37 ms per loop
ogrid: 100 loops, best of 3: 8.33 ms per loop
swapaxis: 100 loops, best of 3: 2.43 ms per loop
transpose: 100 loops, best of 3: 2.08 ms per loop
Defining "best" is debatable, but here's one way with advanced-indexing -
mask[:,mapping, np.arange(F)[:,None]].swapaxes(1,2)
Another way would be to transpose mapping and then use the range array for the last axis without extending to 2D. Each row last axis (axis=-1) of mapping, decides the order of the elements along second last axis (axis=-2) of mask. So, we need that transpose on mapping. In the first approach, we achieved this transposed behaviour through the latter swapping of axes. I would vouch for this one on efficiency.
Thus, we would have the implementation, like so -
mask[:,mapping.T, np.arange(F)]

Numpy: transpose result of advanced indexing

>>> import numpy as np
>>> X = np.arange(27).reshape(3, 3, 3)
>>> x = [0, 1]
>>> X[x, x, :]
array([[ 0, 1, 2],
[12, 13, 14]])
I need to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout. Hence I would like the result to be transposed:
array([[ 0, 12],
[ 1, 13],
[ 2, 14]])
How do I do that? I would like the result of numpy's "advanced indexing" to be implicitly transposed. Transposing it explicitly with .T at the end is even slower and is not an option.
Update1: in the real world advanced indexing is unavoidable and the subscripts are not guaranteed to be the same.
>>> x = [0, 0, 1]
>>> y = [0, 1, 1]
>>> X[x, y, :]
array([[ 0, 1, 2],
[ 3, 4, 5],
[12, 13, 14]])
Update2: To clarify that this is not an XY problem, here is the actual problem:
I have a large matrix X which contains elements x coming from some probability distribution. The probability distribution of the element depends on the neighbourhood of the element. This distribution is unknown so I follow the Gibbs sampling procedure to build a matrix which has elements from this distribution. In a nutshell it means that I make some initial guess for matrix X and then I keep iterating over the elements of matrix X updating each element x with a formula that depends on the neighbouring values of x. So, for any element of a matrix I need to get its neighbours (advanced indexing) and perform some operation on them (summation in my example). I have used line_profiler to see that the line which takes most of the time in my code is taking the sum of an array with respect to dimension 0 rather than -1. Hence I would like to know if there is a way to produce an already-transposed matrix as a result of advanced indexing.
I would like to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout.
I'm not totally sure what you mean by this. If the underlying array is row-major (the default, i.e. X.flags.c_contiguous == True), then it may be slightly faster to sum it along the 0th dimension. Simply transposing an array using .T or np.transpose() does not, in itself, change how the array is laid out in memory.
For example:
# X is row-major
print(X.flags.c_contiguous)
# True
# Y is just a transposed view of X
Y = X.T
# the indices of the elements in Y are transposed, but their layout in memory
# is the same as in X, therefore Y is column-major rather than row-major
print(Y.flags.c_contiguous)
# False
You can convert from row-major to column-major, for example by using np.asfortranarray(X), but there is no way to perform this conversion without making a full copy of X in memory. Unless you're going to be performing lots of operations over the columns of X then it almost certainly won't be worthwhile doing the conversion.
If you want to store the result of your summation in a column-major array, you could use the out= kwarg to X.sum(), e.g.:
result = np.empty((3, 3), order='F') # Fortran-order, i.e. column-major
X.sum(0, out=result)
In your case the difference between summing over rows vs columns is likely to be very minimal, though - since you are already going to be indexing non-adjacent elements in X you will already be losing the benefit of spatial locality of reference that would normally make summing over rows slightly faster.
For example:
X = np.random.randn(100, 100, 100)
# summing over whole rows is slightly faster than summing over whole columns
%timeit X.sum(0)
# 1000 loops, best of 3: 438 µs per loop
%timeit X.T.sum(0)
# 1000 loops, best of 3: 486 µs per loop
# however, the locality advantage disappears when you are addressing
# non-adjacent elements using fancy indexing
%timeit X[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.72 µs per loop
%timeit X.T[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.63 µs per loop
Update
#senderle has mentioned in the comments that using numpy v1.6.2 he sees the opposite order for the timing, i.e. X.sum(-1) is faster than X.sum(0) for a row-major array. This seems to be related to the version of numpy he is using - using v1.6.2 I can reproduce the order that he observes, but using two newer versions (v1.8.2 and 1.10.0.dev-8bcb756) I observe the opposite (i.e. X.sum(0) is faster than X.sum(-1) by a small margin). Either way, I don't think it's likely that changing the memory order of the array is likely to help much for the OP's case.

Categories

Resources