Setup
I have the following two implementations of a matrix-calculation:
The first implementation uses a matrix of shape (n, m) and the calculation is repeated in a for-loop for repetition-times:
import numpy as np
from numba import jit
#jit
def foo():
for i in range(1, n):
for j in range(1, m):
_deleteA = (
matrix[i, j] +
#some constants added here
)
_deleteB = (
matrix[i, j-1] +
#some constants added here
)
matrix[i, j] = min(_deleteA, _deleteB)
return matrix
repetition = 3
for x in range(repetition):
foo()
2. The second implementation avoids the extra for-loop and, hence, includes repetition = 3 into the matrix, which is then of shape (repetition, n, m):
#jit
def foo():
for i in range(1, n):
for j in range(1, m):
_deleteA = (
matrix[:, i, j] +
#some constants added here
)
_deleteB = (
matrix[:, i, j-1] +
#some constants added here
)
matrix[:, i, j] = np.amin(np.stack((_deleteA, _deleteB), axis=1), axis=1)
return matrix
Questions
Regarding both implementations, I discovered two things regarding their performance with %timeit in iPython.
The first implementation profits hugely from #jit, while the second does not at all (28ms vs. 25sec in my testcase). Can anybody imagine why #jit does not work anymore with a numpy-array of shape (repetition, n, m)?
Edit
I moved the former second question to an extra post since asking multiple questions is concidered bad SO-style.
The question was:
When neglecting #jit, the first implementation is still a lot faster (same test-case: 17sec vs. 26sec). Why is numpy slower when working on three instead of two dimensions?
I'm not sure what your setup is here, but I re-wrote your example slightly:
import numpy as np
from numba import jit
##jit(nopython=True)
def foo(matrix):
n, m = matrix.shape
for i in range(1, n):
for j in range(1, m):
_deleteA = (
matrix[i, j] #+
#some constants added here
)
_deleteB = (
matrix[i, j-1] #+
#some constants added here
)
matrix[i, j] = min(_deleteA, _deleteB)
return matrix
foo_jit = jit(nopython=True)(foo)
and then timings:
m = np.random.normal(size=(100,50))
%timeit foo(m) # in a jupyter notebook
# 2.84 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit foo_jit(m) # in a jupyter notebook
# 3.18 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So here numba is a lot faster as expected. One thing to consider is that global numpy arrays do not behave in numba as you might expect:
https://numba.pydata.org/numba-doc/dev/user/faq.html#numba-doesn-t-seem-to-care-when-i-modify-a-global-variable
It's usually better to pass in the data as I did in the example.
Your issue in the second case is that numba does not support amin at this time. See:
https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
You can see this if you pass nopython=True to jit. So in current versions of numba (0.44 or earlier at current), it will fall back to objectmode which often is no faster than not using numba and sometimes is slower since there is some call overhead.
Related
I want to vectorise the triple sum
\sum_{i=1}^I\sum_{j=1}^J\sum_{m=1}^J a_{ijm}
such that I end up with a matrix
A \in \mathbb{R}^{I \times J}
where A_{kl} = \sum_{i=1}^k\sum_{j=1}^l\sum_{m=1}^l a_{ijm} for k = 1,...,I and l = 1, ...,J
carrying forward the sums to avoid pointless recomputation.
I currently use this code:
np.cumsum(np.cumsum(np.cumsum(a, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
but it is inefficient as it does lots of extra work and extracts the correct matrix at the end with the diagonal method. I can't think of how to make this faster.
The main challenge here is to compute the inner two sums, i.e. the sum of the square slices of a matrix originating from the top left. The final sum is just a cumsum on top of that along the 0th axis.
Setup:
import numpy as np
I, J = 100, 100
arr = np.random.rand(I, J, J)
Your implementation:
%%timeit
out = np.cumsum(np.cumsum(np.cumsum(arr, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
# 10.9 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your implementation improved by taking the diagonal before cumsumming over the 0th axis:
%%timeit
out = arr.cumsum(axis=1).cumsum(axis=2).diagonal(axis1=1, axis2=2).cumsum(axis=0)
# 6.25 ms ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Finally, some tril/triu trickery:
%%timeit
out = np.cumsum(np.cumsum(np.tril(arr, k=-1).sum(axis=2) + np.triu(arr).sum(axis=1), axis=1), axis=0)
# 3.15 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is already better, but admittedly still not ideal. I don't see a better way to compute the inner two sums noted above with pure numpy.
You can use Numba so to produce a very fast implementation. Here is the code:
import numba as nb
import numpy as np
#nb.njit('(float64[:,:,::1],)', parallel=True)
def compute(arr):
ni, nj, nk = arr.shape
assert nj == nk
result = np.empty((ni, nj))
# Parallel cumsum along the axis 1 and 2 + extraction of the diagonal
for i in nb.prange(ni):
tmp = np.zeros(nk)
for j in range(nj):
for k in range(nk):
tmp[k] += arr[i, j, k]
result[i, j] = np.sum(tmp[:j+1])
# Cumsum along the axis 0
for i in range(1, ni):
for k in range(nk):
result[i, k] += result[i-1, k]
return result
result = compute(a)
Here are performance results on my 6-core i5-9600KF with a 100x100x100 float64 input array:
Initial code: 12.7 ms
Chryophylaxs v1: 7.1 ms
Chryophylaxs v2: 5.5 ms
Numba: 0.2 ms
This implementation is significantly faster than all others. It is about 64 times faster than the initial implementation. It is also actually optimal on my machine since it completely saturate the bandwidth of my RAM only for reading the input array (which is mandatory). Note that it is better not to use multiple threads for very small arrays.
Note that this code also use far less memory as it only need 8 * nk * num_threads bytes of temporary storage as opposed to 16 * ni * nj * nk bytes for the initial solution.
Here's some data I've generated:
import numpy as np
import pandas as pd
import scipy
import scipy.spatial
df = pd.DataFrame(
{
"item_1": np.random.randint(low=0, high=10, size=1000),
"item_2": np.random.randint(low=0, high=10, size=1000),
}
)
embeddings = {item_id: np.random.randn(100) for item_id in range(0, 10)}
def get_distance(item_1, item_2):
arr1 = embeddings[item_1]
arr2 = embeddings[item_2]
return scipy.spatial.distance.cosine(arr1, arr2)
I'd like to apply get_distance to each row. I can do:
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
But that would be very slow for large datasets.
Is there a way to calculate the cosine similarity of the embeddings corresponding to each row, without using DataFrame.apply?
For scipy version
%%timeit
df.apply(lambda row: get_distance(row["item_1"], row["item_2"]), axis=1)
# 38.3 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For what its worth I added numba with extra complication
Thinking about memory and numpy broadcast use tmp allocation, I used for loops
Also it is worth considering passing arguments, maybe you can pass vectors instead of dictionary.
Also first run is slow due to compilation
Also you can make it parallel with numba
#nb.njit((nb.float64[:, ::100], nb.float64[:, ::100]))
def cos(a, b):
norm_a = np.empty((a.shape[0],), dtype=np.float64)
norm_b = np.empty((b.shape[0],), dtype=np.float64)
cos_ab = np.empty((a.shape[0],), dtype=np.float64)
for i in nb.prange(a.shape[0]):
sq_norm = 0.0
for j in range(100):
sq_norm += a[i][j] ** 2
norm_a[i] = sq_norm ** 0.5
for i in nb.prange(b.shape[0]):
sq_norm = 0.0
for j in range(100):
sq_norm += b[i][j] ** 2
norm_b[i] = sq_norm ** 0.5
for i in nb.prange(a.shape[0]):
dot = 0.0
for j in range(100):
dot += a[i][j] * b[i][j]
cos_ab[i] = 1 - dot / (norm_a[i] * norm_b[i])
return cos_ab
%%timeit
cos(item_1_embedded, item_2_embedded)
# 218 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using vectorized numpy operations directly is much faster:
item_1_embedded = np.array([embeddings[x]for x in df.item_1])
item_2_embedded = np.array([embeddings[x]for x in df.item_2])
cos_dist = 1 - np.sum(item_1_embedded*item_2_embedded, axis=1)/(np.linalg.norm(item_1_embedded, axis=1)*np.linalg.norm(item_2_embedded, axis=1))
(This version runs in 771 µs on average on my pc, vs 37.4 ms for the DataFrame.apply, which makes the pure numpy version about 50 times faster).
You can vectorize the call to cosine with numpy.vectorize. There is a slight gain in speed (34 ms vs 53 ms)
vec_cosine = np.vectorize(scipy.spatial.distance.cosine)
vec_cosine(df['item_1'].map(embeddings),
df['item_2'].map(embeddings))
output:
array([0.90680875, 0.90999454, 0.99212814, 1.12455852, 1.06354469,
0.95542037, 1.07133003, 1.07133003, 0. , 1.00837058,
0. , 0.93961103, 0.8943738 , 1.04872436, 1.21171375,
1.04621226, 0.90392229, 1.0365102 , 0. , 0.90180297,
0.90180297, 1.04516879, 0.94877277, 0.90180297, 0.93713404,
...
1.17548653, 1.11700641, 0.97926805, 0.8943738 , 0.93961103,
1.21171375, 0.91817959, 0.91817959, 1.04674315, 0.88210679,
1.11806218, 1.07816675, 1.00837058, 1.12455852, 1.04516879,
0.93713404, 0.93713404, 0.95542037, 0.93876964, 0.91817959])
I have a somewhat contrived example to cytonize, where I want a function to:
accept a 1D numpy array of arbitrary length (~100'000 ÷ 1'000'000 np.float64's)
do some filtering on it
return results as a new [numpy?] array of the same length
The code and profiling is as follows:
%%cython -a
from libc.stdlib cimport malloc, free
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview(double[:] arr):
cdef:
int N = arr.shape[0], i
double *out_ptr = <double *> malloc(N * sizeof(double))
double[:] out = <double[:N]>out_ptr
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.
free(out_ptr)
return np.asarray(out)
My question is can I do any better with this?
As DavidW has pointed out, your code has some issues with memory management and it would be better to use a numpy-array directly:
%%cython
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview_correct(double[:] arr):
cdef:
int N = arr.shape[0], i
double[:] out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
It is about as fast as the faulty original version:
import numpy as np
np.random.seed(0)
k= np.random.rand(5*10**7)
%timeit func_memview(k) # 413 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_memview_correct(k) # 412 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The question is how this code could be made faster? Most obvious options are
Parallelization.
Using vectorization/SIMD instructions.
It is notoriously hard to ensure that the C-code generated by Cython gets vectorized, see for example this SO-post. For many compilers it is necessary to use contiguous memory view to improve the situation, i.e.:
%%cython -c=/O3
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview_correct_cont(double[::1] arr): // <---- HERE
cdef:
int N = arr.shape[0], i
double[::1] out = np.empty(N) // <--- HERE
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
On my machine it is not really much faster
%timeit func_memview_correct_cont(k) # 402 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Other compilers might do better. However, I've often seen gcc and msvc struggling with producing optimal assembler for code typical for filtering (see for example this SO-question). Clang is much better at this, so the easiest solution would be probably to use numba:
import numba as nb
#nb.njit
def nb_func(arr):
N = arr.shape[0]
out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return out
which outperforms the cython code by almost factor of 3:
%timeit nb_func(k) # 151 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is easy to parallelize the numba version using prange, but the win is not that much: parallelized version runs in 116ms on my machine.
To summarize: For such type of tasks my advice is to use numba. Using cython is trickier and the final performance will be down to the compiler used in the background.
I am trying to define a few of famous kernels like RBF, hyperbolic tangent, Fourier and etc for svm.SVR method in sklearn library. I started working on rbf (I know there's a default kernel in svm for rbf but I need to define one to be able to customize it later), and found some useful link in here and chose this one:
def my_kernel(X,Y):
K = np.zeros((X.shape[0],Y.shape[0]))
for i,x in enumerate(X):
for j,y in enumerate(Y):
K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
return K
clf=SVR(kernel=my_kernel)
I used this one because I could use it for my train (with shape of [3850,4]) and test data (with shape of [1200,4]) which have different shapes. But the problem is that it's too slow and I have to wait so long for the results. I even used static-typing and memoryviews in cython, but its performance is not as good as the default svm rbf kernel. I also found this link which is about the same problem but working with numpy.einsum and numexpr.evaluate is a little bit confusing for me. It turns out that this was the best code in terms of speed performance:
from scipy.linalg.blas import sgemm
def app2(X, gamma, var):
X_norm = -gamma*np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(A + B + C)', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
'g' : gamma,\
'v' : var\
})
This code just works for one input (X) and I couldn't find a way to modify it for my case (two inputs with two different sizes - The kernel function gets matrices with shape (m,n) and (l,n) and outputs (m,l) according to svm docs ). I guess I only need to replace the K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2) from the first code in the second one to speed it up. Any helps would be appreciated.
You can just pipe your original kernel through pythran
kernel.py:
import numpy as np
#pythran export my_kernel(float64[:,:], float64[:,:])
def my_kernel(X,Y):
K = np.zeros((X.shape[0],Y.shape[0]))
for i,x in enumerate(X):
for j,y in enumerate(Y):
K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
return K
Compilation step:
> pythran kernel.py
There's no rewriting step (you need to put the kernel in a separate file though) and the acceleration is significant: 19 times faster on my laptop, using
> python -m timeit -s 'from numpy.random import random; x = random((100,100)); y = random((100,100)); from svr_kernel import my_kernel as k' 'k(x,y)'
to gather timings.
Three possible variants
Variants 1 and 3 makes use of
(a-b)^2 = a^2 + b^2 - 2ab
as described here or here. But for special cases like a small second dimension Variant 2 is also OK.
import numpy as np
import numba as nb
import numexpr as ne
from scipy.linalg.blas import sgemm
def vers_1(X,Y, gamma, var):
X_norm = -gamma*np.einsum('ij,ij->i',X,X)
Y_norm = -gamma*np.einsum('ij,ij->i',Y,Y)
return ne.evaluate('v * exp(A + B + C)', {\
'A' : X_norm[:,None],\
'B' : Y_norm[None,:],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=Y, trans_b=True),\
'g' : gamma,\
'v' : var\
})
#Maybe easier to read but slow, if X.shape[1] gets bigger
#nb.njit(fastmath=True,parallel=True)
def vers_2(X,Y):
K = np.empty((X.shape[0],Y.shape[0]),dtype=X.dtype)
for i in nb.prange(X.shape[0]):
for j in range(Y.shape[0]):
sum=0.
for k in range(X.shape[1]):
sum+=(X[i,k]-Y[j,k])**2
K[i,j] = np.exp(-1*sum)
return K
#nb.njit(fastmath=True,parallel=True)
def vers_3(A,B):
dist=np.dot(A,B.T)
TMP_A=np.empty(A.shape[0],dtype=A.dtype)
for i in nb.prange(A.shape[0]):
sum=0.
for j in range(A.shape[1]):
sum+=A[i,j]**2
TMP_A[i]=sum
TMP_B=np.empty(B.shape[0],dtype=A.dtype)
for i in nb.prange(B.shape[0]):
sum=0.
for j in range(B.shape[1]):
sum+=B[i,j]**2
TMP_B[i]=sum
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
dist[i,j]=np.exp((-2.*dist[i,j]+TMP_A[i]+TMP_B[j])*-1)
return dist
Timings
gamma = 1.
var = 1.
X=np.random.rand(3850,4)
Y=np.random.rand(1200,4)
res_1=vers_1(X,Y, gamma, var)
res_2=vers_2(X,Y)
res_3=vers_3(X,Y)
np.allclose(res_1,res_2)
np.allclose(res_1,res_3)
%timeit res_1=vers_1(X,Y, gamma, var)
19.8 ms ± 615 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res_2=vers_2(X,Y)
16.1 ms ± 938 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=vers_3(X,Y)
13.5 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In my project (about clustering algorithms, specifically k-medoids) is crucial to be able to compute pairwise distances efficiently. I have a dataset of ~60,000 objects. The problem is, distances must be computed between inhomogeneous vectors, i.e. vectors which may differ in length (in that case, missing items are treated as if they were 0).
Here is a minimal working example:
# %%
MAX_LEN = 11
N = 100
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
import numpy as np
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]
%timeit compute_distances_np()
I was testing my Python lists implementation versus a numpy implementation.
Here are the results (computation times):
Python lists: 79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy arrays: 226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Why is there such a huge difference? I supposed numpy arrays were really fast.
Is there a way to improve my code? Am I misunderstanding the inner workings of numpy?
Edit: I may need, in the future, to be able to use a custom distance function for pairwise distances computations. The method should work also for data sets of length 60'000 without running out of memory.
I believe you can just make your arrays dense and set the unused last elements to 0s.
import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform
def batch_pdist(x, metric, batchsize=1000):
dists = np.zeros((len(x), len(x)))
for i in range(0, len(x), batchsize):
for j in range(0, len(x), batchsize):
dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
dists[i:i+batchsize, j:j+batchsize] = dist_batch
return dists
MIN_LEN = 5
MAX_LEN = 11
N = 10000
M = 10
data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)
dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())
Timing Output:
%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Edit:
For a custom distance function see the very bottom of this documentation.
I finally found probably the most straightforward way to solve this problem without changing too much the code and rely solely on computations and not on memory (since that could be unfeasible for very large datasets).
Based on juanpa.arrivillaga suggestion, I tried numba, that is a library that speeds up array-oriented and math-heavy Python code and is targeted mainly at numpy. You can read a good guide on optimizing Python code here: https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.
MAX_LEN = 11
N = 100
# Pure Python lists implementation.
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
# numpy+numba implementation.
import numpy as np
from numba import jit
#jit
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
#jit
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])
%timeit compute_distances_np()
Timing output:
%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, the numpy with numba optimizations is about 19 times faster (with no other code optimization involved).