numpy - einsum vs naive implementation runtime performaned

numpy - einsum vs naive implementation runtime performaned - python

I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?

In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.

This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

Related

Fastest way to iterate through multiple 2d numpy arrays with numba

When using numba and accessing elements in multiple 2d numpy arrays, is it better to use the index or to iterate the arrays directly, because I'm finding that a combination of the two is the fastest which seems counterintuitive to me? Or is there another better way to do it?
For context, I am trying to speed up the implementation of the raytracing approach in this paper https://iopscience.iop.org/article/10.1088/1361-6560/ac1f38/pdf.
I have a function which takes the intensity before propagation and the displacement maps that result from the propagation. The resulting intensity is then the original intensity displaced by the displacement maps pixel by pixel with sub-pixel displacements being proportionately shared between the respective adjacent pixels. On a side note, can this be implemented directly in numpy or in another library, as I've noticed it is similar to opencv's remap function.
import numpy as np
from numba import njit
#njit
def raytrace_range(intensity_0, d_y, d_x):
"""
Args:
intensity_0 (2d numpy array): intensity before propagation
d_y (2d numpy array): Displacement along y in pixels
d_x (2d numpy array): Displacement along x in pixels
Returns:
intensity_z (2d numpy array): intensity after propagation
"""
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i in range(n_x):
for j in range(n_y):
i_ij = intensity_0[i, j]
dx_ij=d_x[i,j]
dy_ij=d_y[i,j]
# Always the same from here down
if not dx_ij and not dy_ij:
intensity_z[i,j]+=i_ij
continue
i_new=i
j_new=j
#Calculating displacement bigger than a pixel
if np.abs(dx_ij)>1:
x = np.floor(dx_ij)
i_new=int(i+x)
dx_ij=dx_ij-x
if np.abs(dy_ij)>1:
y = np.floor(dy_ij)
j_new=int(j+y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement
if 0<=i_new and i_new<n_y and 0<=j_new and j_new<n_x:
intensity_z[i_new,j_new]+=i_ij*(1-np.abs(dx_ij))*(1-np.abs(dy_ij))
if i_new<n_y-1 and dx_ij>=0:
if j_new<n_y-1 and dy_ij>=0:
intensity_z[i_new+1, j_new]+=i_ij*dx_ij*(1-dy_ij)
intensity_z[i_new+1, j_new+1]+=i_ij*dx_ij*dy_ij
intensity_z[i_new, j_new+1]+=i_ij*(1-dx_ij)*dy_ij
if j_new and dy_ij<0:
intensity_z[i_new+1,j_new]+=i_ij*dx_ij*(1-np.abs(dy_ij))
intensity_z[i_new+1,j_new-1]+=i_ij*dx_ij*np.abs(dy_ij)
intensity_z[i_new,j_new-1]+=i_ij*(1-dx_ij)*np.abs(dy_ij)
if i_new and dx_ij<0:
if j_new<n_x-1 and dy_ij>=0:
intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-dy_ij)
intensity_z[i_new-1,j_new+1]+=i_ij*np.abs(dx_ij)*dy_ij
intensity_z[i_new,j_new+1]+=i_ij*(1-np.abs(dx_ij))*dy_ij
if j_new and dy_ij<0:
intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-np.abs(dy_ij))
intensity_z[i_new-1,j_new-1]+=i_ij*dx_ij*dy_ij
intensity_z[i_new,j_new-1]+=i_ij*(1-np.abs(dx_ij))*np.abs(dy_ij)
return intensity_z
I've tried a few other approaches of which this is the fastest (includes the code from above after the comment # Always the same from here down which I've omitted to keep the question relatively short):
#njit
def raytrace_enumerate(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, i_i in enumerate(intensity_0):
for j, i_ij in enumerate(i_i):
dx_ij=d_x[i,j]
dy_ij=d_y[i,j]
#njit
def raytrace_npndenumerate(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for (i, j), i_ij in np.ndenumerate(intensity_0):
dx_ij=d_x[i,j]
dy_ij=d_y[i,j]
#njit
def raytrace_zip(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, (i_i, dy_i, dx_i) in enumerate(zip(intensity_0, d_y, d_x)):
for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):
#njit
def raytrace_stack1(idydx):
n_y, _, n_x = idydx.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, (i_i, dy_i, dx_i) in enumerate(idydx):
for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):
#njit
def raytrace_stack2(idydx):
n_y, n_x, _ = idydx.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i, k in enumerate(idydx):
for j, (i_ij, dy_ij, dx_ij) in enumerate(k):
Make up some test data and time:
import timeit
rng = np.random.default_rng()
size = (2010, 2000)
margin = 10
test_data = np.pad(10000*rng.random(size=size), margin)
dx = np.pad(10*(rng.random(size=size)-0.5), margin)
dy = np.pad(10*(rng.random(size=size)-0.5), margin)
# Check results are the same
L = [raytrace_range(test_data, dy, dx), raytrace_enumerate(test_data, dy, dx), raytrace_npndenumerate(test_data, dy, dx), raytrace_zip(test_data, dy, dx), raytrace_stack1(np.stack([test_data, dy, dx], axis=1)), raytrace_stack2(np.stack([test_data, dy, dx], axis=2))]
print((np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all())
%timeit raytrace_range(test_data, dy, dx)
%timeit raytrace_enumerate(test_data, dy, dx)
%timeit raytrace_npndenumerate(test_data, dy, dx)
%timeit raytrace_zip(test_data, dy, dx)
%timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays were pre-stacked
%timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
Output:
True
40.4 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.5 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
38.6 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
42 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) #Note this would be the fastest if the arrays were pre-stacked
47.4 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit 3: Turns out that removing if statements make range faster than enumerate. See edit 2 below
Interestingly, in my machine times get awful in the stack1 and stack2 options, and indeed enumerate seems to be fastest. Maybe thanks to enumerate numba understands it is a looping variable?:
In [1]: %timeit raytrace_range(test_data, dy, dx)
...: %timeit raytrace_enumerate(test_data, dy, dx)
...: %timeit raytrace_npndenumerate(test_data, dy, dx)
...: %timeit raytrace_zip(test_data, dy, dx)
...: %timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays we
...: re pre-stacked
...: %timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
61 ms ± 785 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.9 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.9 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
57.5 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
109 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit: Using fastmath=True did not shove up much time, only ~3ms
Edit 2: Although it is not related to the OP's question, after playing a bit with the functions, turns out that removing "superfluous"(*) conditional statements makes it notably faster. Around 20% on my machine. Turns out the implementation works without them (at least the supplied test returns True):
(*) The operations seem to work regardless, as they are "caught" by the lower operations. At least, provided test vector did not report any issues.
#! Using this it is faster:
# Always the same from here down
# if dx_ij==0 and dy_ij==0:
# intensity_z[i,j]+=i_ij
# continue
#Calculating displacement bigger than a pixel
x = np.floor(dx_ij)
i_new=int(i+x)
dx_ij=dx_ij-x
y = np.floor(dy_ij)
j_new=int(j+y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement
In [2]: %timeit raytrace_range(test_data, dy, dx)
...: %timeit raytrace_range2(test_data, dy, dx)
...: %timeit raytrace_enumerate(test_data, dy, dx)
64.8 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.1 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In general, the fastest way to iterate over an array is a basic low-level integer iterator. Such a pattern cause the minimum number of transformation in Numba so the compiler should be able to optimize the code pretty well. Functions likes zip and enumerate often add an additional overhead due to indirect code transformations that are not perfectly optimized out.
Here is a basic example:
import numba as nb
#nb.njit('(int_[::1],)')
def test(arr):
s1 = s2 = 0
for i in range(arr.shape[0]):
s1 += i
s2 += arr[i]
return (s1, s2)
arr = np.arange(200_000)
test(arr)
However, things are more complex when you read/write to multiple arrays simultaneously (which is your case). Indeed, Numpy array can be indexed with negative indices so Numba need to perform bound checking every time. This check is expensive compared to the actual access and it can even break some other optimizations (eg. vectorization).
Consequently, Numba has been optimized so to analyse the code and detect cases where bound checking is not needed and prevent adding expensive checks at runtime. This is the case in the above code but not in your raytrace_range function. enumerate and enumerate+zip can help a lot to remove bound checking because Numba can easily prove that the index lies in the bound of the array (theoretically, it could prove this for raytrace_range but the current implementation is unfortunately not smart enough).
You can mostly solve this problem using assertions. It is not only good for optimization but also to make your code more robust!
Moreover, the indexing of multidimensional arrays is sometimes not perfectly optimized by the underlying JIT (LLVM-Lite). There is no reason for them not to be optimized but compiler use heuristics to optimize the code that are far from being perfect (though pretty good in average). You can help by computing views of lines. This generally result in a tiny improvement though.
Here is the improved code:
#njit
def raytrace_range_opt(intensity_0, d_y, d_x):
n_y, n_x = intensity_0.shape
assert intensity_0.shape == d_y.shape
assert intensity_0.shape == d_x.shape
intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
for i in range(n_x):
row_intensity_0 = intensity_0[i, :]
row_d_x = d_x[i, :]
row_d_y = d_y[i, :]
for j in range(n_y):
assert j >= 0 # Crazy optimization (see later)
i_ij = row_intensity_0[j]
dx_ij = row_d_x[j]
dy_ij = row_d_y[j]
# Always the same from here down
if not dx_ij and not dy_ij:
row_intensity_0[j] += i_ij
continue
# Remaining code left unmodified
Notes
Note that I think the indexing of the function raytrace_enumerate is bogus: It should be for i in range(n_y): for j in range(n_x): instead since the access are done with intensity_0[i, j] and you wrote n_y, n_x = intensity_0.shape. Note that swaping the axis also gives correct results based on your validation function (which is suspicious).
The assert j >= 0 instruction alone results in a 8% speed up which is crazy since the integer iterator j is guaranteed to be positive if the n_x is positive which is always the case since it is a shape! This is clearly a missed optimization of Numba that LLVM-Lite cannot optimize (since LLVM-Lite does not know what is a shape and that they are always positive too). This apparent missing assumption in the Numba code causes additional bound checking (of each of the three arrays) that are pretty expensive.
Benchmark
Here are results on my machine:
raytrace_range: 47.8 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_enumerate: 38.9 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_npndenumerate: 54.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_zip: 41 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack1: 86.7 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack2: 84 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_range_opt: 38.6 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see raytrace_range_opt is the fastest implementation on my machine.

Python: get maximum occurrence in array

I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.
numba implementation
import numba as nb
import numpy as np
import collections
#nb.njit("int64(int64[:])")
def max_count_unique_num(x):
"""
Counts maximum number of unique integer in x.
Args:
x (numpy array): Integer array.
Returns:
Int
"""
# get maximum value
m = x[0]
for v in x:
if v > m:
m = v
if m == 0:
return x.size
# count each unique value
num = np.zeros(m + 1, dtype=x.dtype)
for k in x:
num[k] += 1
# maximum count
m = 0
for k in num:
if k > m:
m = k
return m
For comparisons, I also implemented numpy's unique and collections.Counter
def np_unique(x):
""" Counts maximum occurrence using numpy's unique. """
ux, uc = np.unique(x, return_counts=True)
return uc.max()
def counter(x):
""" Counts maximum occurrence using collections.Counter. """
counts = collections.Counter(x)
return max(counts.values())
timeit
Edit: Add np.bincount for additional comparison, as suggested by #MechanicPig.
In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.
Any way I can generalize the implementation and keep the speed?
Update
After checking the source code of np.unique, an implementation for general cases can be:
#nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
x.sort()
n = 0
k = 0
x0 = x[0]
for v in x:
if x0 == v:
k += 1
else:
if k > n:
n = k
k = 1
x0 = v
# for last item in x if it equals to previous one
if k > n:
n = k
return n
timeit
In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So:
If large integer in x and the size is not large, max_count_unique_num_2 beats max_count_unique_num.
Both max_count_unique_num and max_count_unique_num_2 are significantly faster than np.unique.
Small modification on max_count_unique_num_2 can return the item that has maximum occurrence, even all items having same maximum occurrence.
max_count_unique_num_2 can even be accelerated if x is itself sorted by removing x.sort().

What if shortening your code:
#nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in x:
num[k] += 1
return num.max()
or paralleled:
#nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in nb.prange(x.size):
num[x[k]] += 1
return num.max()
Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.
For handling the floats (or negative values) using Numba:
#nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
num = np.zeros(x.size, dtype=np.int8)
for k in x:
for j in range(x.shape[0]):
if k == x[j]:
num[j] += 1
return num.max()
IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.
Notes: np.bincount can handle just integers.

You can do that without using numpy too.
arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]
If you want to use numpy then try this,
arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]
Both do the exact same job. To check which method is more efficient just do this,
time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method.
for i in range(10**5):
<method you want>
time_f = time.time()
When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.

What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:
def encode(ar):
# Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
flatten = ar.ravel()
perm = flatten.argsort()
sort = flatten[perm]
mask = np.concatenate(([False], sort[1:] != sort[:-1]))
encoded = np.empty(sort.shape, np.int64)
encoded[perm] = mask.cumsum()
encoded.shape = ar.shape
return encoded
def count_max(ar):
return max_count_unique_num(encode(ar))

Fastest way of list comprehension while calculating the distance with a fixed list

I have a list,
a = [1,2,3]
Now I have another list of lists(which is same size as above),
x=[[1,2,3], [4,5,6], [7,8,9]]
Now I want to calculate the distance between each item in x with a them using cosine distance so I am using this,
from scipy import spatial
distances = [spatial.distance.cosine(a, i) for i in x]
Now the above method is taking very long time to execute, I am looking for alternative way to do this most efficiently.

With numpy, you can use broadcasting to do the same computation and take advantage of vectorized operations for more efficiency.
def cosine_distance(a, x):
a = np.array(a)
x = np.array(x)
return 1 - x.dot(a) / (np.linalg.norm(a) * np.linalg.norm(x, axis=1))
Using ipython's %timeit for the example data to compare the execution time:
%timeit [spatial.distance.cosine(a, i) for i in x]
140 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit cosine_distance(a, x)
27.8 µs ± 315 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Creating an array of numbers that add up to 1 with given length

I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?

This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum

For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python: Taking a the outer product of each row of matrix by itself, taking the sum then returning a vector of sums

Say I have a matrix A of dimension N by M.
I wish to return an N dimensional vector V where the nth element is the double sum of all pairwise product of the entries in the nth row of A.
In loops, I guess I could do:
V = np.zeros(A.shape[0])
for n in range(A.shape[0]):
for i in range(A.shape[1]):
for j in range(A.shape[1]):
V[n] += A[n,i] * A[n,j]
I want to vectorise this and I guess I could do:
V_temp = np.einsum('ij,ik->ijk', A, A)
V = np.einsum('ijk->i', A)
But I don't think this is very memory efficient way as the intermediate step V_temp is unnecessarily storing the whole outer products when all I need are sums. Is there a better way to do this?
Thanks

You can use
V=np.einsum("ni,nj->n",A,A)

You are actually calculating
A.sum(-1)**2
In other words, the sum over an outer product is just the product of the sums of the factors.
Demo:
A = np.random.random((1000,1000))
np.allclose(np.einsum('ij,ik->i', A, A), A.sum(-1)**2)
# True
t = timeit.timeit('np.einsum("ij,ik->i",A,A)', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# '948.4210 ms'
t = timeit.timeit('A.sum(-1)**2', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# ' 0.7396 ms'

Perhaps you can use
np.einsum('ij,ik->i', A, A)
or the equivalent
np.einsum(A, [0,1], A, [0,2], [0])
On a 2015 Macbook, I get
In [35]: A = np.random.rand(100,100)
In [37]: %timeit for_loops(A)
640 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [38]: %timeit np.einsum('ij,ik->i', A, A)
658 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.einsum(A, [0,1], A, [0,2], [0])
672 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy - einsum vs naive implementation runtime performaned - python

Related

Fastest way to iterate through multiple 2d numpy arrays with numba

Python: get maximum occurrence in array

Fastest way of list comprehension while calculating the distance with a fixed list

Creating an array of numbers that add up to 1 with given length

Python: Taking a the outer product of each row of matrix by itself, taking the sum then returning a vector of sums

Categories

Resources