I would like to implement a function that works like the numpy.sum function on arrays as on expects, e.g. np.sum([2,3],1) = [3,4] and np.sum([1,2],[3,4]) = [4,6].
Yet a trivial test implementation already behaves somehow awkward:
import numpy as np
def triv(a, b): return a, b
triv_vec = np.vectorize(fun, otypes = [np.int])
triv_vec([1,2],[3,4])
with result:
array([0, 0])
rather than the desired result:
array([[1,3], [2,4]])
Any ideas, what is going on here? Thx
You need otypes=[np.int,np.int]:
triv_vec = np.vectorize(triv, otypes=[np.int,np.int])
print triv_vec([1,2],[3,4])
(array([1, 2]), array([3, 4]))
otypes : str or list of dtypes, optional
The output data type. It must be specified as either a string of typecode characters or a list of data type specifiers. There should be one data type specifier for each output.
My original question was devoted to the fact that the vectorization is doing an internal type-cast and running an internally optimized loop and how much this would affect performance. So here is the answer:
It does, but not with only <23% the effect is not as considerable as I supposed.
import numpy as np
def make_tuple(a, b): return tuple([a, b])
make_tuple_vec = np.vectorize(make_tuple, otypes = [np.int, np.int])
v1 = np.random.random_integers(-5, high = 5, size = 100000)
v2 = np.random.random_integers(-5, high = 5, size = 100000)
%timeit [tuple([i,j]) for i,j in zip(v1,v2)] # ~ 596 µs per loop
%timeit make_tuple_vec(v1, v2) # ~ 544 µs per loop
Furthermore the tuple generating function doesn't vectorized as expected, like e.g. the map function map(make_tuple, v1, v2), which is the clear looser of the competition with a 100 times slower exectution time:
%timeit map(make_tuple, v1, v2) # ~ 64.4 ms per loop
Related
I have a 4 dimensional array called new_arr. Given a list of indices, I want to update new_arr based on an old array I have stored, old_arr. I am using a for loop to do this, but it's inefficient. My code looks something like this:
update_indices = [(2,33,1,8), (4,9,49,50), ...] #as an example
for index in update_indices:
i,j,k,l = index
new_arr[i][j][k][l] = old_arr[i][j][k][l]
It's taking a very long time because update_indices is large. Is there a way I can update all of the terms at once or do this more efficiently?
Out of curiosity I have benchmarked the various improvements posted in the comments and found that working on flat indices is fastest.
I used the following setup:
rt numpy as np
n = 57
d = 4
k = int(1e6)
dt = np.double
new_arr = np.arange(n**d, dtype=dt).reshape(d * (n,))
new_arr2 = np.arange(n**d, dtype=dt).reshape(d * (n,))
old_arr = 2*np.arange(n**d, dtype=dt).reshape(d * (n,))
update_indices = list({tuple(np.random.randint(n, size=d)) for _ in range(int(k*1.1))})[:k]
where update_indices is a list of 1e6 unique index tuples.
Using the original technique from the question
%%timeit
for index in update_indices:
i,j,k,l = index
new_arr[i][j][k][l] = old_arr[i][j][k][l]
takes 1.47 s ± 19.3 ms.
Direct tuple-indexing as suggested by #defladamouse
%%timeit
for index in update_indices:
new_arr[index] = old_arr[index]
indeed gives us a speedup of 2: 778 ms ± 41.8 ms
If update_indices is not given but can be constructed as ndarray as suggested by #Jérôme Richard
update_indices_array = np.array(update_indices, dtype=np.uint32)
(the conversion itself takes 1.34 s) the path to much faster implementations is open.
In order to index numpy arrays by multidimensional list of locations we cannot use update_indices_array directly as index, but pack its columns into a tuple:
%%timeit
idx = tuple(update_indices_array.T)
new_arr2[idx] = old_arr[idx]
Giving another speedup of roughly 9: 83.5 ms ± 1.45
If we dont leave the computation of memory offsets to ndarray.__getitem__,
but compute the correspondig flat indices "by hand", we can become even faster:
%%timeit
idx_weights = np.cumprod((1,) + new_arr2.shape[:0:-1])[::-1]
update_flat = update_indices_array # idx_weights
new_arr2.ravel()[update_flat] = old_arr.ravel()[update_flat]
resulting in 41.6 ms ± 1.04 ms, another factor of 2 and a cumulative speedup factor of 35 compared with the original version.
idx_weights is simply an off-by-one reverse-order cumulative product of the array dimensions.
I assume that this speedup of 2 comes from the fact that the memory offsets / flat indices are computed twice in new_arr2[idx] = old_arr[idx] and only once in update_flat = update_indices_array # idx_weight.
Just do:
idx = np.array([(2,33,1,8), (4,9,49,50), ...])
new_arr[idx[:,0],idx[:,1],idx[:,2],idx[:,3]] = old_arr[idx[:,0],idx[:,1],idx[:,2],idx[:,3]]
no need for a loop.
Consider x, an n x 3 vector.
Is it possible, using built-in methods of numpy or tensorflow, or any Python library, to get a vector of the order n x 1 such that each row is a vector of the order 3 x 1? That is, if x is [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]T, can a vector of the form [[1, 2, 3]T, [4, 5, 6]T, [7, 8, 9]T, [10, 11, 12]T]T be got without for loops or introducing new axes like, say, np.newaxis?
The motive behind this is to get only the diagonal elements of the dot product of x and its transpose. We could, of course, do something like np.diag(x.dot(x.T)). But, if n is significantly large, say, 202933, one can hear the CPU's fan suffering from wheezing. How to actually avoid doing the dot product of all the elements and do so of only the diagonal ones of the phantom dot product without iteration?
Let's take a look at the formula for each element in the result of multiplying x by its own transpose. I don't feel like trying to coerce the Stack Overflow UI into allowing me to use tensor notation, so we'll look conceptually.
Each element at row i, column j of the result is the dot product of row i in x and column j in x.T. Now column j in x.T is just row j in x, and the diagonal is where i and j are the same. So what you want is a sum across the rows of the squared elements of x:
d = (x * x).sum(axis=1)
To address the first part of your question, the transpose operation in numpy rarely makes a copy of your data, so x.T or np.transpose(x) are constant-time operations for even the largest arrays. The reason is that numpy arrays are stored as a block of data along with some meta-data like dimensions, strides between elements in each dimension, and data size. Transposing an array only requires you to modify a small amount of meta-data in the array object, like sizes along each dimension and strides, not copy the whole data set.
The time consuming part is performing the multiplication. Simply having the objects x and x.T costs almost nothing: they both use the same data buffer.
This function is likely one of the most efficient ways to handle this. (Taken from trimesh: https://github.com/mikedh/trimesh/blob/main/trimesh/util.py#L589)
def diagonal_dot(a, b):
"""
Dot product by row of a and b.
There are a lot of ways to do this though
performance varies very widely. This method
uses a dot product to sum the row and avoids
function calls if at all possible.
Parameters
------------
a : (m, d) float
First array
b : (m, d) float
Second array
Returns
-------------
result : (m,) float
Dot product of each row
"""
# make sure `a` is numpy array
# doing it for `a` will force the multiplication to
# convert `b` if necessary and avoid function call otherwise
a = np.asanyarray(a)
# 3x faster than (a * b).sum(axis=1)
# avoiding np.ones saves 5-10% sometimes
return np.dot(a * b, [1.0] * a.shape[1])
Comparing performance of some equivalent versions:
In [1]: import numpy as np; import trimesh
In [2]: a = np.random.random((10000, 3))
In [3]: b = np.random.random((10000, 3))
In [4]: %timeit (a * b).sum(axis=1)
1000 loops, best of 3: 181 us per loop
In [5]: %timeit np.einsum('ij,ij->i', a, b)
10000 loops, best of 3: 62.7 us per loop
In [6]: %timeit np.diag(np.dot(a, b.T))
1 loop, best of 3: 429 ms per loop
In [7]: %timeit np.dot(a * b, np.ones(a.shape[1]))
10000 loops, best of 3: 61.3 us per loop
In [8]: %timeit trimesh.util.diagonal_dot(a, b)
10000 loops, best of 3: 55.2 us per loop
I have two 3d arrays A and B with shape (N, 2, 2) that I would like to multiply element-wise according to the N-axis with a matrix product on each of the 2x2 matrix. With a loop implementation, it looks like
C[i] = dot(A[i], B[i])
Is there a way I could do this without using a loop? I've looked into tensordot, but haven't been able to get it to work. I think I might want something like tensordot(a, b, axes=([1,2], [2,1])) but that's giving me an NxN matrix.
It seems you are doing matrix-multiplications for each slice along the first axis. For the same, you can use np.einsum like so -
np.einsum('ijk,ikl->ijl',A,B)
We can also use np.matmul -
np.matmul(A,B)
On Python 3.x, this matmul operation simplifies with # operator -
A # B
Benchmarking
Approaches -
def einsum_based(A,B):
return np.einsum('ijk,ikl->ijl',A,B)
def matmul_based(A,B):
return np.matmul(A,B)
def forloop(A,B):
N = A.shape[0]
C = np.zeros((N,2,2))
for i in range(N):
C[i] = np.dot(A[i], B[i])
return C
Timings -
In [44]: N = 10000
...: A = np.random.rand(N,2,2)
...: B = np.random.rand(N,2,2)
In [45]: %timeit einsum_based(A,B)
...: %timeit matmul_based(A,B)
...: %timeit forloop(A,B)
100 loops, best of 3: 3.08 ms per loop
100 loops, best of 3: 3.04 ms per loop
100 loops, best of 3: 10.9 ms per loop
You just need to perform the operation on the first dimension of your tensors, which is labeled by 0:
c = tensordot(a, b, axes=(0,0))
This will work as you wish. Also you don't need a list of axes, because it's just along one dimension you're performing the operation. With axes([1,2],[2,1]) you're cross multiplying the 2nd and 3rd dimensions. If you write it in index notation (Einstein summing convention) this corresponds to c[i,j] = a[i,k,l]*b[j,k,l], thus you're contracting the indices you want to keep.
EDIT: Ok, the problem is that the tensor product of a two 3d object is a 6d object. Since contractions involve pairs of indices, there's no way you'll get a 3d object by a tensordot operation. The trick is to split your calculation in two: first you do the tensordot on the index to do the matrix operation and then you take a tensor diagonal in order to reduce your 4d object to 3d. In one command:
d = np.diagonal(np.tensordot(a,b,axes=()), axis1=0, axis2=2)
In tensor notation d[i,j,k] = c[i,j,i,k] = a[i,j,l]*b[i,l,k].
I would like to speed up this code :
import numpy as np
import pandas as pd
a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
u1 = 13 * up + x
u.append(u1)
up = u1
for y in np.nditer(downloss[14:]):
d1 = 13 * down + y
d.append(d1)
down = d1
The data below:
0 49.00
1 48.76
2 48.52
3 48.28
...
36785758 13.88
36785759 14.65
36785760 13.19
Name: Clsprc, Length: 36785759, dtype: float64
The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?
It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.
def moving_average(a, n=14) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:
The nditer will try to provide chunks that are as large as possible to
the inner loop. By forcing ‘C’ and ‘F’ order, we get different
external loop sizes. This mode is enabled by specifying an iterator
flag.
Observe that with the default of keeping native memory order, the
iterator is able to provide a single one-dimensional chunk, whereas
when forcing Fortran order, it has to provide three chunks of two
elements each.
for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'):
# x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.
In simplified terms, I think this is what the loops are doing:
upgain=np.array([.1,.2,.3,.4])
u=[]
up=1
for x in upgain:
u1=10*up+x
u.append(u1)
up=u1
producing:
[10.1, 101.2, 1012.3, 10123.4]
np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).
https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.
To use frompyfunc, define:
def foo(x,y):return 10*x+y
The loop application (above) would be
def loopfoo(upgain,u,u1):
for x in upgain:
u1=foo(u1,x)
u.append(u1)
return u
The 'vectorized' version would be:
vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)
The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155
In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]
In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)
For this small list, loopfoo is faster (3µs v 21µs)
For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:
In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop
In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop
For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.
Input data
Produce n matrices of a given size (here, 3x2). I also chose n=25, but I let n to lay the emphasis on the fact that what we have is a bunch of matrices.
import numpy as np
n = 25
data = np.random.rand(n, 3, 2)
This is just a format example : I can't change it. Or if I do, one must take into account the computational cost of this change.
Current implementation
What I want to achieve atomically is:
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
or, in a faster fashion:
output = [dr-d0 for (dr, d0) in zip(datum[:,0], datum[:,1:])]
Problem
This is too slow and:
output = datum[:,1:] - datum[:,0]
does not work since the behavior of the subtraction operation is not well defined in that case. Plus, this kind of slicing is not very efficient.
Cython/Nuitka/PyPy and the likes are possible solutions, but I'd like to stick with raw Numpy for now, if possible. Maybe some kind of function that can be applied on elements of the outer loop of a numpy array very quickly without the overhead of python stuff...
The np.vectorize function doesn't work on:
def get_diff(mat):
return mat[1:] - mat[0]
So I invoke ye, High Priests of Numpy, servants of Python to enlighten my poor soul!
EDIT:
XY Problem
(I didn't know it had a name)
What I actually want to do is to determine the content (read "volume") of a lot of simplices (read "tetrahedra"). The easiest and most efficient way to do it, AFAIK is to calculate:
np.linalg.det(mat[:1]-mat[0])
Then let me rephrase my question: how can I efficiently compute the content of any ensemble of simplices of dimension k using plain python and numpy?
I suggest data[:,1:] - data[:,0,None]. The None creates a new axis (officially you're supposed to use np.newaxis, which makes it very clear what you're doing), and then the subtraction will behave the way you want it to.
Correcting what I think are errors in your list comprehension:
def loop(data):
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
return output
def listcomp(data):
output = [dr-d0 for (d0, dr) in zip(data[:,0], data[:,1:])]
return output
def sub(data):
output = data[:,1:] - data[:,0,None]
return output
we have
>>> import numpy as np
>>> n = 25
>>> data = np.random.rand(n, 3, 2)
>>> res_loop = loop(data)
>>> res_listcomp = listcomp(data)
>>> res_sub = sub(data)
>>> np.allclose(res_loop, res_listcomp)
True
>>> np.allclose(res_loop, res_sub)
True
>>>
>>> %timeit loop(data)
10000 loops, best of 3: 184 µs per loop
>>> %timeit listcomp(data)
10000 loops, best of 3: 158 µs per loop
>>> %timeit sub(data)
100000 loops, best of 3: 12.8 µs per loop