Python: faster function for kernel evaluation - python

I've got a function like below that evaluates a kernel between the instances x and y:
def my_hik(x, y):
"""Histogram-Intersection-Kernel """
summe = 0
for i in xrange(len(x)):
summe += min(x[i],y[i])
return summe
#return np.sum(np.min(np.array([[x],[y]]),0))
metrics.pairwise.pairwise_kernels(instances, metric=my_hik, n_jobs=-1)
I call it with sklearns pairwise_kernels-function. But my data (some 3000 instances with a hundred attributes) seems to be too large and the calculation for one matrix takes minutes (as the function is called 9*10^6 times). Is there a way to make the function run faster?

def fast_hik(x, y):
return np.minimum(x, y).sum()
>>> x = np.random.randn(100)
>>> y = np.random.randn(100)
>>> %timeit my_hik(x, y)
10000 loops, best of 3: 50.3 µs per loop
>>> %timeit fast_hik(x, y)
100000 loops, best of 3: 5.55 µs per loop
Greater speedups are obtained for longer vectors:
>>> x = np.random.randn(1000)
>>> y = np.random.randn(1000)
>>> %timeit my_hik(x, y)
1000 loops, best of 3: 498 µs per loop
>>> %timeit fast_hik(x, y)
100000 loops, best of 3: 7.92 µs per loop


Why does Numpy operate 0.1M matrix 1000 times is much faster than operate 1M matrix 100 times?

I found a weird phenomenon that Numpy seems much faster when operating the smaller matrix, even when the total amount of data is identical. Why does this happen?
import time
import numpy as np
def a():
ts = time.time()
for i in range(100):
x = np.random.rand(100000, 2).reshape(-1, 2)
y = np.random.rand(100000)
te = time.time()
print(te - ts)
def b():
ts = time.time()
for i in range(1000):
x = np.random.rand(10000, 2).reshape(-1, 2)
y = np.random.rand(10000)
te = time.time()
print(te - ts)
They are roughly the same if you do not use the time library - which makes sense. Variation in computing function time is huge. Only once is not going to validate that a function is faster (if it is not significantly faster).
def a():
for i in range(100):
x = np.random.rand(100000, 2).reshape(-1, 2)
y = np.random.rand(100000)
def b():
for i in range(1000):
x = np.random.rand(10000, 2).reshape(-1, 2)
y = np.random.rand(10000)
%%timeit -n100 -r10
225 ms ± 4.78 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%%timeit -n100 -r10
224 ms ± 2.86 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

Fast way to calculate conditional function

What is the most fast way to calculate function like
# here x is just a number
def f(x):
if x >= 0:
return np.log(x+1)
return -np.log(-x+1)
One possible way is:
# here x is an array
def loga(x)
cond = [x >= 0, x < 0]
choice = [np.log(x+1), -np.log(-x+1)
return, choice)
But seems numpy goes through array element by element.
Is there any way to use something conceptually similar to np.exp(x) to achieve better performance?
def f(x):
return (x/abs(x)) * np.log(1+abs(x))
In cases like these, masking helps -
def mask_vectorized_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
out[mask] = np.log(x[mask]+1)
out[mask_rev] = -np.log(-x[mask_rev]+1)
return out
Introducing numexpr module helps us further.
import numexpr as ne
def mask_vectorized_numexpr_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
x_masked = x[mask]
x_rev_masked = x[mask_rev]
out[mask] = ne.evaluate('log(x_masked+1)')
out[mask_rev] = ne.evaluate('-log(-x_rev_masked+1)')
return out
Inspired by #user2685079's post and then using the logarithmetic property : log(A**B) = B*log(A), we can push in the sign into the log computations and this allows us to do more work with numexpr's evaluate expression, like so -
s = (-2*(x<0))+1 # np.sign(x)
out = ne.evaluate('log( (abs(x)+1)**s)')
Computing sign using comparison gives us s in another way -
s = (-2*(x<0))+1
Finally, we can push this into the numexpr evaluate expression -
def mask_vectorized_numexpr_app2(x):
return ne.evaluate('log( (abs(x)+1)**((-2*(x<0))+1))')
Runtime test
Loopy approach for comparison -
def loopy_app(x):
out = np.empty_like(x)
for i in range(len(out)):
out[i] = f(x[i])
return out
Timings and verification -
In [141]: x = np.random.randn(100000)
...: print np.allclose(loopy_app(x), mask_vectorized_app(x))
...: print np.allclose(loopy_app(x), mask_vectorized_numexpr_app(x))
...: print np.allclose(loopy_app(x), mask_vectorized_numexpr_app2(x))
In [142]: %timeit loopy_app(x)
...: %timeit mask_vectorized_numexpr_app(x)
...: %timeit mask_vectorized_numexpr_app2(x)
10 loops, best of 3: 108 ms per loop
100 loops, best of 3: 3.6 ms per loop
1000 loops, best of 3: 942 µs per loop
Using #user2685079's solution using np.sign to replace the first part and then with and without numexpr evaluation -
In [143]: %timeit np.sign(x) * np.log(1+abs(x))
100 loops, best of 3: 3.26 ms per loop
In [144]: %timeit np.sign(x) * ne.evaluate('log(1+abs(x))')
1000 loops, best of 3: 1.66 ms per loop
Using numba
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
The Numba project is supported by Continuum Analytics and The Gordon and Betty Moore Foundation (Grant GBMF5423).
from numba import njit
import numpy as np
def pir(x):
a = np.empty_like(x)
for i in range(a.size):
x_ = x[i]
_x = abs(x_)
a[i] = np.sign(x_) * np.log(1 + _x)
return a
np.isclose(pir(x), f(x)).all()
x = np.random.randn(100000)
# My proposal
%timeit pir(x)
1000 loops, best of 3: 881 µs per loop
# OP test
%timeit f(x)
1000 loops, best of 3: 1.26 ms per loop
# Divakar-1
%timeit mask_vectorized_numexpr_app(x)
100 loops, best of 3: 2.97 ms per loop
# Divakar-2
%timeit mask_vectorized_numexpr_app2(x)
1000 loops, best of 3: 621 µs per loop
Function definitions
from numba import njit
import numpy as np
def pir(x):
a = np.empty_like(x)
for i in range(a.size):
x_ = x[i]
_x = abs(x_)
a[i] = np.sign(x_) * np.log(1 + _x)
return a
import numexpr as ne
def mask_vectorized_numexpr_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
x_masked = x[mask]
x_rev_masked = x[mask_rev]
out[mask] = ne.evaluate('log(x_masked+1)')
out[mask_rev] = ne.evaluate('-log(-x_rev_masked+1)')
return out
def mask_vectorized_numexpr_app2(x):
return ne.evaluate('log( (abs(x)+1)**((-2*(x<0))+1))')
def f(x):
return (x/abs(x)) * np.log(1+abs(x))
You can slightly improve the speed of your second solution by using np.where instead of
def loga(x):
cond = [x >= 0, x < 0]
choice = [np.log(x+1), -np.log(-x+1)]
return, choice)
def logb(x):
return np.where(x>=0, np.log(x+1), -np.log(-x+1))
In [16]: %timeit loga(arange(-1000,1000))
10000 loops, best of 3: 169 µs per loop
In [17]: %timeit logb(arange(-1000,1000))
10000 loops, best of 3: 98.3 µs per loop
In [18]: np.all(loga(arange(-1000,1000)) == logb(arange(-1000,1000)))
Out[18]: True

Python Multiple Simple Linear Regression

Note this is not a question about multiple regression, it is a question about doing simple (single-variable) regression multiple times in Python/NumPy (2.7).
I have two m x n arrays x and y. The rows correspond to each other, and each pair is the set of (x,y) points for a measurement. That is, plt.plot(x.T, y.T, '.') would plot each of m datasets/measurements.
I'm wondering what the best way to perform the m linear regressions is. Currently I loop over the rows and use scipy.stats.linregress(). (Assume I don't want solutions based on doing linear algebra with the matrices but instead want to work with this function, or an equivalent black-box function.) I could try np.vectorize, but the docs indicate it also loops.
With some experimenting, I've also found a way to use list comprehensions with map() and get correct results. I've put both solutions below. In IPython, `%%timeit`` returns, using a small dataset (commented out):
(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop
To try magnifying this, I made a much bigger random dataset (dimension trials x trials):
(loop, trials = 1000) 1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000) 1 loops, best of 3: 256 ms per loop
(map, trials = 10000) 1 loops, best of 3: 2.37 s per loop
That's a decent speedup on a really big set, but I was expecting a bit more. Is there a better way?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
yrow = y[k,:]
slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)
Single variable linear regression is simple enough to vectorize it manually:
def multiple_linregress(x, y):
x_mean = np.mean(x, axis=1, keepdims=True)
x_norm = x - x_mean
y_mean = np.mean(y, axis=1, keepdims=True)
y_norm = y - y_mean
slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
np.einsum('ij,ij->i', x_norm, x_norm))
intercept = y_mean[:, 0] - slope * x_mean[:, 0]
return np.column_stack((slope, intercept))
With some made up data:
m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)
it outperforms your looping options by a fair margin:
%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop

Is there a faster way to separate the minimum and maximum of two arrays?

In [3]: f1 = rand(100000)
In [5]: f2 = rand(100000)
# Obvious method:
In [12]: timeit fmin = np.amin((f1, f2), axis=0); fmax = np.amax((f1, f2), axis=0)
10 loops, best of 3: 59.2 ms per loop
In [13]: timeit fmin, fmax = np.sort((f1, f2), axis=0)
10 loops, best of 3: 30.8 ms per loop
In [14]: timeit fmin = np.where(f2 < f1, f2, f1); fmax = np.where(f2 < f1, f1, f2)
100 loops, best of 3: 5.73 ms per loop
In [36]: f1 = rand(1000,100,100)
In [37]: f2 = rand(1000,100,100)
In [39]: timeit fmin = np.amin((f1, f2), axis=0); fmax = np.amax((f1, f2), axis=0)
1 loops, best of 3: 6.13 s per loop
In [40]: timeit fmin, fmax = np.sort((f1, f2), axis=0)
1 loops, best of 3: 3.3 s per loop
In [41]: timeit fmin = np.where(f2 < f1, f2, f1); fmax = np.where(f2 < f1, f1, f2)
1 loops, best of 3: 617 ms per loop
Like, maybe there's a way to do both where commands in one step with 2 returns?
Why isn't amin implemented the same way as where, if it's so much faster?
Use numpy's built in element-wise maximum and minimum - they are faster than where.
The notes in the numpy docs for maximum confirm this:
Equivalent to np.where(x1 > x2, x1, x2), but faster and does proper broadcasting.
The line you would want for your first test would be something like:
fmin = np.minimum(f1, f2); fmax = np.maximum(f1, f2)
My own results show this to be quite a bit faster. Note that minimum and maximum will work on any n-dimensional array as long as the two arguments are the same shape.
Using amax 3.506
Using sort 1.830
Using where 0.635
Using numpy maximum, minimum 0.178

Numpy max slow when applied to list of arrays

I carry out some computations to obtain a list of numpy arrays. Subsequently, I would like to find the largest values along the first axis. My current implementation (see below) is very slow and I would like to find alternatives.
pending = [<list of items>]
matrix = [compute(item) for item in pending if <some condition on item>]
dominant = np.max(matrix, axis = 0)
Revision 1: This implementation is faster (~10x; presumably because numpy does not need to figure out the shape of the array)
pending = [<list of items>]
matrix = [compute(item) for item in pending if <some condition on item>]
matrix = np.vstack(matrix)
dominant = np.max(matrix, axis = 0)
I ran a couple of tests and the slowdown seems to be due to an internal conversion of the list of arrays to a numpy array
Timer unit: 1e-06 s
Total time: 1.21389 s
Line # Hits Time Per Hit % Time Line Contents
4 def direct_max(list_of_arrays):
5 1000 1213886 1213.9 100.0 np.max(list_of_arrays, axis = 0)
Total time: 1.20766 s
Line # Hits Time Per Hit % Time Line Contents
8 def numpy_max(list_of_arrays):
9 1000 1151281 1151.3 95.3 list_of_arrays = np.array(list_of_arrays)
10 1000 56384 56.4 4.7 np.max(list_of_arrays, axis = 0)
Total time: 0.15437 s
Line # Hits Time Per Hit % Time Line Contents
12 #profile
13 def stack_max(list_of_arrays):
14 1000 102205 102.2 66.2 list_of_arrays = np.vstack(list_of_arrays)
15 1000 52165 52.2 33.8 np.max(list_of_arrays, axis = 0)
Is there any way to speed up the max function or is it possible to populate a numpy array efficiently with the results of my calculation such that max is fast?
You can use reduce(np.maximum, matrix), here is a test:
import numpy as np
N, M = 1000, 1000
matrix = [np.random.rand(N) for _ in xrange(M)]
%timeit np.max(matrix, axis = 0)
%timeit np.max(np.vstack(matrix), axis = 0)
%timeit reduce(np.maximum, matrix)
The result is:
10 loops, best of 3: 116 ms per loop
10 loops, best of 3: 10.6 ms per loop
100 loops, best of 3: 3.66 ms per loop
`argmax()' is more difficult, but you can use a for loop:
def argmax_list(matrix):
m = matrix[0].copy()
idx = np.zeros(len(m),
for i, a in enumerate(matrix[1:], 1):
mask = m < a
m[mask] = a[mask]
idx[mask] = i
return idx
It's still faster than argmax():
%timeit np.argmax(matrix, axis=0)
%timeit np.argmax(np.vstack(matrix), axis=0)
%timeit argmax_list(matrix)
10 loops, best of 3: 131 ms per loop
10 loops, best of 3: 21 ms per loop
100 loops, best of 3: 13.1 ms per loop

