I've been recently trying to compute distances to top 2 nearest neighbors in Python Numba as follows
#jit(nopython=True)
def _latent_dim_kernel(data, pointers, indices, nrange, sampling_percentage = 1):
pdists_t2 = np.zeros((nrange, 2))
for a in range(nrange):
rct = 0
for b in range(nrange):
if np.random.random() > 1- sampling_percentage:
if a == b:
continue
r1 = _get_sparse_row(a, data, pointers, indices)
r2 = _get_sparse_row(b, data, pointers, indices)
dist = np.linalg.norm(r2 - r1)
if rct > 1:
if pdists_t2[a,0] > dist:
pdists_t2[a,0] = dist
elif pdists_t2[a,1] > dist:
pdists_t2[a,1] = dist
else:
pdists_t2[a,rct] = dist
rct += 1
return pdists_t2
The data, pointers and indices are x.data, x.indptr, x.indices of a CSR matrix (scipy).
This works fine, however, is substantially slower than doing
squareform(pdist(matrix)).sort(axis=1)[:,1:3]
How can I speed this further without additional memory overhead?
Thanks!
Make use of pairwise distances from sklearn
Pairwise distances of sparse matrices are supported (no dense temporary array needed)
This algorithm uses a algebraic reformulation like in this answer
It can be a lot faster on high dimensional problems like yours (20k) since most of the calculation is done within a highly optimized matrix-matrix product.
Check if this method is precise enough, it is less numerically stable
than a "naive" approach pdist uses
Example
import numpy as np
from scipy import sparse
from sklearn import metrics
from scipy.spatial import distance
matrix=sparse.random(1_000, 20_000, density=0.05, format='csr', dtype=np.float64)
%%timeit
dist_2=distance.squareform(distance.pdist(matrix.todense()))
dist_2.sort(axis=1)
dist_2=dist_2[:,1:3]
#10.1 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dist=metrics.pairwise.euclidean_distances(matrix,squared=True)
dist.sort(axis=1)
dist=np.sqrt(dist[:,1:3])
#401 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I am trying to optimize some code that uses logs (the mathematical kind, not the timestamp record kind :)) and I found something strange that I haven't been able to find any answers for online. We have log(a/b) = log(a) - log(b), so I have written some code to compare the performance of the two methods.
import numpy as np
import numba as nb
# create some large random walk data
x = np.random.normal(0, 0.1, int(1e7))
x = abs(x.min()) + 100 + x # make all values >= 100
#nb.njit
def subtract_log(arr, tau):
"""arr is a numpy array, tau is an int"""
for t in range(tau, arr.shape[0]):
a = np.log(arr[t]) - np.log(arr[t - tau])
return None
#nb.njit
def divide_log(arr, tau):
"""arr is a numpy array, tau is an int"""
for t in range(tau, arr.shape[0]):
a = np.log(arr[t] / arr[t - tau])
return None
%timeit subtract_log(x, 100)
>>> 252 ns ± 0.319 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit divide_log(x, 100)
>>> 5.57 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So we see that subtracting logs is ~20,000 times faster than dividing by logs. I find this strange because I would have thought that in subtracting logs, the log series approximation would have to be calculated twice. But perhaps it's something to do with how numpy broadcasts operations?
The above example is trivial as we don't do anything with the result of the calculation. Below is a more realistic example where we return the result of the calculation.
#nb.njit
def subtract_log(arr, tau):
"""arr is a numpy array, tau is an int"""
out = np.empty(arr.shape[0] - tau)
for t in range(tau, arr.shape[0]):
f = t - tau
out[f] = np.log(arr[t]) - np.log(arr[f])
return out
#nb.njit
def divide_log(arr, tau):
"""arr is a numpy array, tau is an int"""
out = np.empty(arr.shape[0] - tau)
for t in range(tau, arr.shape[0]):
f = t - tau
out[f] = np.log(arr[t] / arr[f])
return out
out1 = subtract_log(x, 100)
out2 = divide_log(x, 100)
np.testing.assert_allclose(out1, out2, atol=1e-8) # True
%timeit subtract_log(x, 100)
>>> 129 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit divide_log(x, 100)
>>> 93.4 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now we see the times are the same order of magnitude, but subtracting logs is some 40% slower than dividing.
Can anyone explain these discrepancies?
Why is subtracting logs so much faster than dividing logs for the trivial case?
Why is subtracting logs 40% slower than dividing logs when we store the value in an array? I know there is significant setup cost in initializing an array np.empty() - initializing an array in subtract_log() in the trivial case, but without storing values in it brings the time up from 252ns to 311us.
Don't measure "useless" things, a compiler may optimize it completely away
If you turn of division by zero check (error_model="numpy"), both functions take about 280ns. Not because of fast calculation, but because they are actually doing nothing.
Optimizing away useless calculations is expected, but sometimes LLVM can't detect all of it.
In the second case you are comparing the runtime of 2 logarithms, to 1 logarithm and one division. (the substractions/additions as well as multiplications are a lot faster). There can be differences in calculation time, depending on the log implementation and the processor. But also have a look at the results, they are not exactly the same.
At least for a floa64 division (FDIV) you can have a look at the instruction tables
from Agner Fog.
I want to compute the squared euclidean in a fast way, as described here:
What is the fastest way to compute an RBF kernel in python?
Note1: I am only interested in the distance, not the RBF kernel.
Note2: I am neglecting numexpr here and only use numpy directly.
In short, I compute:
|| x - y ||^2 = ||x||^2 + ||y||^2 - 2. * (x # y.T)
I am able to compute the distance matrix faster by a factor of ~10 compared to scipy.pdist with this. However, I observe numerical issues, which get worse if I take the square root to get the euclidean distance. I have values that are in the order of 1E-8 - 1E-7, which should be exactly zero (i.e. duplicated points or distance to self point).
Question:
Are there ways or ideas to overcome these numerical issues (perferable without sacrificing too much of the evaluation speed)? Or are the numerical issues the reason why this path is not taken (e.g. by scipy.pdist) in the first place?
Example:
This is a small code example to show the numerical issues (not the speed ups, please look at the answers of the linked SO thread above).
import numpy as np
M = np.random.rand(1000, 10)
M_norm = np.sum(M**2, axis=1)
res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M # M.T
unique = np.unique(np.diag(res)) # analytically all diag values are exactly zero
sqrt_unique = np.sqrt(unique)
print(unique)
print(sqrt_unique)
Example output:
[-2.66453526e-15 -1.77635684e-15 -8.88178420e-16 -4.44089210e-16
0.00000000e+00 4.44089210e-16 8.88178420e-16 1.77635684e-15
3.55271368e-15]
[ nan nan nan nan
0.00000000e+00 2.10734243e-08 2.98023224e-08 4.21468485e-08
5.96046448e-08]
As you can see some values are also negative (which results in nan after taking the sqrt). Of course these are easy to catch -- but the small positives have a large error for the euclidean case (e.g. abs_error=5.96046448e-08)
as per my comment, using abs is probably your best option for cleaning up the numerical stability inherent in this algorithm. as you're concerned about performance you should probably be using the mutating assignment operators as they cause less garbage to be created and hence can be much faster. also, when running this with many features (e.g. 10k) I see pdist being slower than this implementation.
putting the above together we get:
import numpy as np
def edist0(M):
"calculate pairwise euclidean distance"
M_norm = np.sum(M**2, axis=1)
res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M # M.T
return np.sqrt(np.abs(res))
def edist1(M):
"optimised calculation of pairwise euclidean distance"
M_norm = np.einsum('ij,ij->i', M, M)
res = M # M.T
res *= -2.
res += M_norm[:, np.newaxis]
res += M_norm[np.newaxis, :]
return np.sqrt(np.abs(res, out=res), out=res)
timing this in IPython with:
from scipy.spatial import distance
M = np.random.rand(1000, 10000)
%timeit distance.squareform(distance.pdist(M))
%timeit edist0(M)
%timeit edist1(M)
I get:
2.82 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
296 ms ± 6.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
153 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
and no errors/warnings from sqrt
the linked question also points to scikit-learn as having good distance kernel good implementations, the euclidean one being pairwise_distances which benchmarks as:
from sklearn.metrics import pairwise_distances
%timeit pairwise_distances(M)
170 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which might be nice to use if you're already using that package
I want to make the code below more efficient, but I'm not sure how. I want to use numpy and native python libraries only.
iterations = 100
aggregation = 0
for i in range(iterations):
aggregation += np.sum(np.linalg.norm(dat[dat_filter==i] - dat_points[i], axis=1))
dat is a nxD matrix
dat_filter is a vector of length n containing an identifier from 0 to num_iterations
dat_points is num_iterators x D matrix.
Basically I am computing distances between a matrix Dat whose points belong to a class versus the points of that class
It isn't very easy to vectorize the problem since you have square roots of parts of your data that are not nescesarilly the same length. You could vectorize parts of it for a small speed up:
import numpy as np
# Make some data
n = 200000
d = 100
iterations = 2000
np.random.seed(42)
dat = np.random.random((n, d))
dat_filter = np.random.randint(0, n_it, size=n)
dat_points = np.random.random((n_it, d))
def slow(dat, dat_filter, dat_points, iterations):
aggregation = 0
for i in range(iterations):
# Wrote linalg.norm as standard numpy operations,
# such that numba can be used on the code as well
aggregation += np.sum(np.sqrt(np.sum((dat[dat_filter==i] - dat_points[i])**2, axis=1)))
return aggregation
def fast(dat, dat_filter, dat_points, iterations):
# Rearrange the arrays such that the correct operations are done
sort_idx = np.argsort(dat_filter)
filtered_dat_squared_sum = np.sum((dat - dat_points[dat_filter])**2, axis=1)[sort_idx]
# Count the number of different 'iterations'
counts = np.unique(dat_filter, return_counts=True)[1]
aggregation = 0
idx = 0
for c in counts:
aggregation += np.sum(np.sqrt(filtered_dat_squared_sum[idx:idx+c]))
idx += c
return aggregation
timings:
In [1]: %timeit slow(dat, dat_filter, dat_points, n_it)
3.47 s ± 314 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [2]: %timeit fast(dat, dat_filter, dat_points, n_it)
846 ms ± 81.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using numba with the slow function speeds it up slightly, but still not as fast as the fast method. Numba with the fast function makes the call slower on the matrix sizes I tested.
Not sure I titled this well, but basically I have a reference coordinate, in the format of (x,y,z), and a large list/array of coordinates also in that format. I need to get the euclidean distance between each, so with numpy and scipy in theory I should be able to do an operation such as:
import numpy, scipy.spatial.distance
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
distances = scipy.spatial.distance.euclidean(b, a)
But instead of getting an array back I get an error: ValueError: Input vector should be 1-D.
Not sure how to resolve this error and get what I want without having to resort to loops and such, which sort of defeats the purpose of using Numpy.
Long term I want to use these distances to calculate truth masks for counting distance values in bins.
I'm not sure if I'm just using the function wrong or using the wrong function, I haven't been able to find anything in the documentation that would work better.
The documentation of scipy.spatial.distance.euclidean states, that only 1D-vectors are allowed as inputs. Thus you must loop over your arrays like:
distances = np.empty(b.shape[0])
for i in range(b.shape[0]):
distances[i] = scipy.spatial.distance.euclidean(a, b[i])
If you want to have a vectorized implementation, you need to write your own function. Perhaps using np.vectorize with a correct signature will also work, but this is in fact also just a short-hand for a for-loop and will thus have the same performance as a simple for-loop.
As stated in my comment to hannes wittingham's solution, I'll post a one-liner which is focussing on performance:
distances = ((b - a)**2).sum(axis=1)**0.5
Writing out all the calculations reduces the number of separate functions calls and thus assignments of the intermediate results to new arrays. Thus it is about 22% faster than using the solution of hannes wittingham for an array shape of b.shape == (20, 3) and about 5% faster for an array shape of
b.shape == (20000, 3):
a = np.array([1, 1, 1,])
b = np.random.rand(20, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 5.37 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit euclidean_distances(a, b)
# 6.89 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
b = np.random.rand(20000, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 588 µs ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distances(a, b)
# 616 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But your are losing the flexibility of being able to easily change to distance calculation routine. When using the scipy.spatial.distance module, you can change the calculation routing by simply calling another method.
To improve the calculation performance even further, you can use a jit (just in time) compiler like numba for your functions:
import numba as nb
#nb.njit
def euc(a, b):
return ((b - a)**2).sum(axis=1)**0.5
This reduces the time needed to do the calculations by about 70% for small arrays and by about 60% for large arrays. Unluckily the axis keyword for np.linalg.norm is not yet supported by numba.
It's not actually too hard to write your own function to do this - here's mine, which you're welcome to use.
If you are carrying out this operation over a large number of points and speed matters, I would guess this function will beat a for-loop based solution for speed by a long way - numpy is designed to be efficient when carrying out operations on a whole matrix.
import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
def euclidean_distances(ref_point, co_ords_array):
diffs = co_ords_array - ref_point
sqrd_diffs = numpy.square(diffs)
sum_sqrd_diffs = numpy.sum(sqrd_diffs, axis = 1)
euc_dists = numpy.sqrt(sum_sqrd_diffs)
return euc_dists
This code will get the euclidean norm which should work in many cases, and is fairly quick, and one line. Other methods are more efficient or flexible depending on the needs, and I would favour some of the other solutions posted depending on the work being done.
import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
distances = numpy.linalg.norm(a - b, axis = 1)
Note the extra set of [] in the definition of a
import numpy, scipy.spatial.distance
a = numpy.array([[1,1,1]])
b = numpy.random.rand(20,3)
distances = scipy.spatial.distance.cdist(b, a, metric='euclidean')
I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?
Addendum
I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don't think that the suggested approximations would work.
You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I'm doing manually below).
numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:
def polyval(p, x):
y = np.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.
What I'm about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we'll have to assume floating-point output instead of letting the upcasting rules determine the output's dtype. However, it will speed this up slighly and make it use less memory:
def faster_polyval(p, x):
y = np.zeros(x.shape, dtype=float)
for i, v in enumerate(p):
y *= x
y += v
return y
As an example, let's say we have the following input:
# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]
# One-million element array
x = np.linspace(-10, 10, 1e6)
The results are identical:
In [3]: np_result = np.polyval(p, x)
In [4]: new_result = faster_polyval(p, x)
In [5]: np.allclose(np_result, new_result)
Out[5]: True
And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):
In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop
In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop
For really huge inputs, the memory usage difference will matter more than the speed differences. The "bare" numpy version will use ~2x more memory at peak usage than the faster_polyval version.
I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster.
And it is interesting to see that simple implementations are faster as #Joe Kington shows. (I hoped for some optimisation by numpy.)
So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.
def fastest_polyval(x, a):
y = a[-1]
for ai in a[-2::-1]:
y *= x
y += ai
return y
It avoids the initial zero array and needs one loop less.
y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1] # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)
np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)
# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)