Related
I want to make the following computation, i use random arrays for demonstration:
a = np.random.randint(10, size=(100,3))
b = np.random.randint(10, size=(3,2))
result = np.zeros(100)
for i in range(100):
result[i] = a[i] # b # b.T # a[i].T
To speed up the calculation, i thought about removing the for loop by an einsteins sum.
So I tried the following, with the same vectors:
result = np.einsum('ij,jk,jk,ij->i', a, b, b, a)
I put the 'i' on the right hand side of the einsum, because the result vector shows a correct size. However, the result is slightly different.
Can my problem be solved with an einsum?
Franz
In one einsum, it would be -
np.einsum('ij,jl,kl,ik->i',a,b,b,a)
Bringing in matrix-multiplication with np.dot -
np.einsum('ij,jk,ik->i',a,b.dot(b.T),a)
Or with more of it -
np.einsum('ij,ij->i',a.dot(b.dot(b.T)),a)
With np.matmul/#-operator in Python 3.x, it translates to -
((a#(b#b.T))[:,None,:] # a[:,:,None])[:,0,0]
I am trying to expand the numpy "tensordot" such that things like:
K_ijklm = A_ki * B_jml can be written in a clear way like this: K = mytensordot(A,B,[2,0],[1,4,3])
To my understanding, numpy's tensordot (with optional argument 0) would be able to do something like this: K_kijml = A_ki * B_jml, i.e. keeping the order of the indexes. Therefore I would then have to do a number of np.swapaxes() to obtain the matrix `K_ijklm', which in a complicated case can be an easy source of errors (potentially very hard to debug).
The problem is that my implementation is slow (10x slower than tensordot [EDIT: It is actually MUCH slower than that]), even when using numba. I was wondering if anyone would have some insight on what could be done to improve the performance of my algorithm.
MWE
import numpy as np
import numba as nb
import itertools
import timeit
#nb.jit()
def myproduct(dimN):
N=np.prod(dimN)
L=len(dimN)
Product=np.zeros((N,L),dtype=np.int32)
rn=0
for n in range(1,N):
for l in range(L):
if l==0:
rn=1
v=Product[n-1,L-1-l]+rn
rn = 0
if v == dimN[L-1-l]:
v = 0
rn = 1
Product[n,L-1-l]=v
return Product
#nb.jit()
def mytensordot(A,B,iA,iB):
iA,iB = np.array(iA,dtype=np.int32),np.array(iB,dtype=np.int32)
dimA,dimB = A.shape,B.shape
NdimA,NdimB=len(dimA),len(dimB)
if len(iA) != NdimA: raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB: raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN=np.zeros(NdimN,dtype=np.int32)
dimN[iA]=dimA
dimN[iB]=dimB
Out=np.zeros(dimN)
indexes = myproduct(dimN)
for nidxs in indexes:
idxA = tuple(nidxs[iA])
idxB = tuple(nidxs[iB])
v=A[(idxA)]*B[(idxB)]
Out[tuple(nidxs)]=v
return Out
A=np.random.random((4,5,3))
B=np.random.random((6,4))
def runmytdot():
return mytensordot(A,B,[0,2,3],[1,4])
def runtensdot():
return np.tensordot(A,B,0).swapaxes(1,3).swapaxes(2,3)
print(np.all(runmytdot()==runtensdot()))
print(timeit.timeit(runmytdot,number=100))
print(timeit.timeit(runtensdot,number=100))
Result:
True
1.4962144780438393
0.003484356915578246
You have run into a known issue. numpy.zeros requires a tuple when creating a multidimensional array. If you pass something other than a tuple, it sometimes works, but that's only because numpy is smart about converting the object into a tuple first.
The trouble is that numba does not currently support conversion of arbitrary iterables into tuples. So this line fails when you try to compile it in nopython=True mode. (A couple of others fail too, but this is the first.)
Out=np.zeros(dimN)
In theory you could call np.prod(dimN), create a flat array of zeros, and reshape it, but then you run into the very same problem: the reshape method of numpy arrays requires a tuple!
This is quite a vexing problem with numba -- I had not encountered it before. I really doubt the solution I have found is the correct one, but it is a working solution that allows us to compile a version in nopython=True mode.
The core idea is to avoid using tuples for indexing by directly implementing an indexer that follows the strides of the array:
#nb.jit(nopython=True)
def index_arr(a, ix_arr):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
return a.ravel()[ix]
#nb.jit(nopython=True)
def index_set_arr(a, ix_arr, val):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
a.ravel()[ix] = val
This allows us to get and set values without needing a tuple.
We can also avoid using reshape by passing the output buffer into the jitted function, and wrapping that function in a helper:
#nb.jit() # We can't use nopython mode here...
def mytensordot(A, B, iA, iB):
iA, iB = np.array(iA, dtype=np.int32), np.array(iB, dtype=np.int32)
dimA, dimB = A.shape, B.shape
NdimA, NdimB = len(dimA), len(dimB)
if len(iA) != NdimA:
raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB:
raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN = np.zeros(NdimN, dtype=np.int32)
dimN[iA] = dimA
dimN[iB] = dimB
Out = np.zeros(dimN)
return mytensordot_jit(A, B, iA, iB, dimN, Out)
Since the helper contains no loops, it adds some overhead, but the overhead is pretty trivial. Here's the final jitted function:
#nb.jit(nopython=True)
def mytensordot_jit(A, B, iA, iB, dimN, Out):
for i in range(np.prod(dimN)):
nidxs = int_to_idx(i, dimN)
a = index_arr(A, nidxs[iA])
b = index_arr(B, nidxs[iB])
index_set_arr(Out, nidxs, a * b)
return Out
Unfortunately, this does not wind up generating as much of a speedup as we might like. On smaller arrays it's about 5x slower than tensordot; on larger arrays it's still 50x slower. (But at least it's not 1000x slower!) This is not too surprising in retrospect, since dot and tensordot are both using BLAS under the hood, as #hpaulj reminds us.
After finishing this code, I saw that einsum has solved your real problem -- nice!
But the underlying issue that your original question points to -- that indexing with arbitrary-length tuples is not possible in jitted code -- is still a frustration. So hopefully this will be useful to someone else!
tensordot with scalar axes values can be obscure. I explored it in
How does numpy.tensordot function works step-by-step?
There I deduced that np.tensordot(A, B, axes=0) is equivalent using axes=[[], []].
In [757]: A=np.random.random((4,5,3))
...: B=np.random.random((6,4))
In [758]: np.tensordot(A,B,0).shape
Out[758]: (4, 5, 3, 6, 4)
In [759]: np.tensordot(A,B,[[],[]]).shape
Out[759]: (4, 5, 3, 6, 4)
That in turn is equivalent to calling dot with a new size 1 sum-of-products dimenson:
In [762]: np.dot(A[...,None],B[...,None,:]).shape
Out[762]: (4, 5, 3, 6, 4)
(4,5,3,1) * (6,1,4) # the 1 is the last of A and 2nd to the last of B
dot is fast, using BLAS (or equivalent) code. Swapping axes and reshaping is also relatively fast.
einsum gives us a lot of control over axes
replicating the above products:
In [768]: np.einsum('jml,ki->jmlki',A,B).shape
Out[768]: (4, 5, 3, 6, 4)
and with swapping:
In [769]: np.einsum('jml,ki->ijklm',A,B).shape
Out[769]: (4, 4, 6, 3, 5)
A minor point - the double swap can be written as one transpose:
.swapaxes(1,3).swapaxes(2,3)
.transpose(0,3,1,2,4)
According to http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html, if x and y are given and input arrays are 1-D, where is equivalent to [xv if c else yv for (c,xv, yv) in zip(x!=0, 1/x, x)]. When doing runtime benchmarks, however, they have significantly different speeds:
x = np.array(range(-500, 500))
%timeit np.where(x != 0, 1/x, x)
10000 loops, best of 3: 23.9 µs per loop
%timeit [xv if c else yv for (c,xv, yv) in zip(x!=0, 1/x, x)]
1000 loops, best of 3: 232 µs per loop
Is there a way I can rewrite the second form so that it has a similar runtime to the first? The reason I ask is because I'd like to use a slightly modified version of the second case to avoid division by zero errors:
[1 / xv if c else xv for (c,xv) in zip(x!=0, x)]
Another question: the first case returns a numpy array while the second case returns a list. Is the most efficient way to have the second case return an array is to first make a list and then convert the list to an array?
np.array([xv if c else yv for (c,xv, yv) in zip(x!=0, 1/x, x)])
Thanks!
You just asked about 'delaying' the 'where':
numpy.where : how to delay evaluating parameters?
and someone else just asked about divide by zero:
Replace all elements of a matrix by their inverses
When people say that where is similar to the list comprehension, they attempt to describe the action, not the actual implementation.
np.where called with just one argument is the same as np.nonzero. This quickly (in compiled code) loops through the argument, and collects the indices of all non-zero values.
np.where when called with 3 arguments, returns a new array, collecting values from the 2 and 3rd arguments based on the nonzero values. But it's important to realize that those arguments must be other arrays. They are not functions that it evaluates element by element.
So the where is more like:
m1 = 1/xv
m2 = xv
[v1 if c else v2 for (c, v1, v2) in zip(x!=0, m1, m2)]
It's easy to run this iteration in compiled code because it just involves 3 arrays of matching size (matching via broadcasting).
np.array([...]) is a reasonable way of converting a list (or list comprehension) into an array. It may be a little slower than some alternatives because np.array is a powerful general purpose function. np.fromiter([], dtype) may be faster in some cases, because it isn't as general (you have to specify dtype, and it it only works with 1d).
There are 2 time proven strategies for getting more speed in element-by-element calculations:
use packages like numba and cython to rewrite the problem as c code
rework your calculations to use existing numpy methods. The use of masking to avoid divide by zero is a good example of this.
=====================
np.ma.where, the version for masked arrays is written in Python. Its code might be instructive. Note in particular this piece:
# Construct an empty array and fill it
d = np.empty(fc.shape, dtype=ndtype).view(MaskedArray)
np.copyto(d._data, xv.astype(ndtype), where=fc)
np.copyto(d._data, yv.astype(ndtype), where=notfc)
It makes a target, and then selectively copies values from the 2 inputs arrays, based on the condition array.
You can avoid division by zero while maintaining performance by using advanced indexing:
x = np.arange(-500, 500)
result = np.empty(x.shape, dtype=float) # set the dtype to whatever is appropriate
nonzero = x != 0
result[nonzero] = 1/x[nonzero]
result[~nonzero] = 0
If you for some reason want to bypass an error with numpy it might be worth looking into the errstate context:
x = np.array(range(-500, 500))
with np.errstate(divide='ignore'): #ignore zero-division error
x = 1/x
x[x!=x] = 0 #convert inf and NaN's to 0
Consider changing the array in place by using np.put():
In [56]: x = np.linspace(-1, 1, 5)
In [57]: x
Out[57]: array([-1. , -0.5, 0. , 0.5, 1. ])
In [58]: indices = np.argwhere(x != 0)
In [59]: indices
Out[59]:
array([[0],
[1],
[3],
[4]], dtype=int64)
In [60]: np.put(x, indices, 1/x[indices])
In [61]: x
Out[61]: array([-1., -2., 0., 2., 1.])
The approach above does not create a new array, which could be very convenient if x is a large array.
I am new in Python/Numpy and is trying to improve my triple for loop into a more efficient calculation, but can't quiet figure out how to do it.
The calculations is carried out on a grid of the size (25,35) and the shapes of arrays is:
A = (8760,25,35)
B = (12,25,35)
The first dimensions in A corresponds to the number hours in a year (~8760), and the first dimension in B is the number of months(12). I want to use the values in B[0,:,:] for the first month, and B[1,:,:] for the second etc.
So far I created, in a very unrefined way, a index array filled with 1,1,1...,2,2,2...,12 to extract the values from B. My code with some random numbers
N,M = (25, 35)
A = np.random.rand(8760,N,M)
B = np.random.rand(12,N,M)
q = len(A)/12
index = np.hstack((np.full((1,q),1),np.full((1,q),2),np.full((1,q),3),np.full((1,q),4),np.full((1,q),5),np.full((1,q),6),np.full((1,q),7),np.full((1,q),8),np.full((1,q),9),np.full((1,q),10),np.full((1,q),11),np.full((1,q),12)))-1
results = np.zeros((len(A),N,M))
for t in xrange(len(A)):
for i in xrange(N):
for j in xrange(M):
results[t][i][j] = some_function(A[t][i][j], B[index[(0,t)]][i][j],H = 80.)
def some_function(A,B,H = 80.0):
results = A*np.log(H/B)/np.log(10./B)
return results
How can increase the speed of this calculation?
NumPy suports broadcasting that allows elementwise operations to be performed across different shaped arrays in a highly optimized manner. In your case, you have the number of rows and columns in A and B the same. But, at the first dimension, the number of elements are different across these two arrays. Looking at the implementation, it seems B 's first dimension elements are repeated per q number until we go over to the next element in it's first dimension. This coincides with the fact that the number of elements in first dimension of B is q times the number of elements in first dimension of A.
Now, going back to broadcasting, the solution would be to split the first dimension of A to have a 4D array, such that we have the number of elements in the first dimension of this reshaped 4D array matching up with the number of elements in B's first dimension. Next up, reshape B to a 4D array as well by creating singleton dimension (dimension with no elements) at the second dimension with B[:,None,:,:]. Then, NumPy would use broadcasting magic and perform broadcasted elementwise multiplications, as that's what we are doing in our some_function.
Here's the vectorized implementation using NumPy's broadcasting capability -
H = 80.0
M,N,R = B.shape
B4D = B[:,None,:,:]
out = ((A.reshape(M,-1,N,R)*np.log(H/B4D))/np.log(10./B4D)).reshape(-1,N,R)
Runtime tests and output verification -
In [50]: N,M = (25, 35)
...: A = np.random.rand(8760,N,M)
...: B = np.random.rand(12,N,M)
...: H = 80.0
...:
In [51]: def some_function(A,B,H = 80.0):
...: return A*np.log(H/B)/np.log(10./B)
...:
In [52]: def org_app(A,B,H):
...: q = len(A)/len(B)
...: index = np.repeat(np.arange(len(B))[:,None],q,axis=1).ravel()[None,:] # Simpler
...: results = np.zeros((len(A),N,M))
...: for t in xrange(len(A)):
...: for i in xrange(N):
...: for j in xrange(M):
...: results[t][i][j] = some_function(A[t][i][j], B[index[(0,t)]][i][j])
...: return results
...:
In [53]: def vectorized_app(A,B,H):
...: M,N,R = B.shape
...: B4D = B[:,None,:,:]
...: return ((A.reshape(M,-1,N,R)*np.log(H/B4D))/np.log(10./B4D)).reshape(-1,N,R)
...:
In [54]: np.allclose(org_app(A,B,H), vectorized_app(A,B,H))
Out[54]: True
In [55]: %timeit org_app(A,B,H)
1 loops, best of 3: 1min 32s per loop
In [56]: %timeit vectorized_app(A,B,H)
10 loops, best of 3: 217 ms per loop
I would like to implement a function that works like the numpy.sum function on arrays as on expects, e.g. np.sum([2,3],1) = [3,4] and np.sum([1,2],[3,4]) = [4,6].
Yet a trivial test implementation already behaves somehow awkward:
import numpy as np
def triv(a, b): return a, b
triv_vec = np.vectorize(fun, otypes = [np.int])
triv_vec([1,2],[3,4])
with result:
array([0, 0])
rather than the desired result:
array([[1,3], [2,4]])
Any ideas, what is going on here? Thx
You need otypes=[np.int,np.int]:
triv_vec = np.vectorize(triv, otypes=[np.int,np.int])
print triv_vec([1,2],[3,4])
(array([1, 2]), array([3, 4]))
otypes : str or list of dtypes, optional
The output data type. It must be specified as either a string of typecode characters or a list of data type specifiers. There should be one data type specifier for each output.
My original question was devoted to the fact that the vectorization is doing an internal type-cast and running an internally optimized loop and how much this would affect performance. So here is the answer:
It does, but not with only <23% the effect is not as considerable as I supposed.
import numpy as np
def make_tuple(a, b): return tuple([a, b])
make_tuple_vec = np.vectorize(make_tuple, otypes = [np.int, np.int])
v1 = np.random.random_integers(-5, high = 5, size = 100000)
v2 = np.random.random_integers(-5, high = 5, size = 100000)
%timeit [tuple([i,j]) for i,j in zip(v1,v2)] # ~ 596 µs per loop
%timeit make_tuple_vec(v1, v2) # ~ 544 µs per loop
Furthermore the tuple generating function doesn't vectorized as expected, like e.g. the map function map(make_tuple, v1, v2), which is the clear looser of the competition with a 100 times slower exectution time:
%timeit map(make_tuple, v1, v2) # ~ 64.4 ms per loop