I need to do a few hundred million euclidean distance calculations every day in a Python project.
Here is what I started out with:
def euclidean_dist_square(x, y):
diff = np.array(x) - np.array(y)
return np.dot(diff, diff)
This is quite fast and I already dropped the sqrt calculation since I need to rank items only (nearest-neighbor search). It is still the bottleneck of the script though. Therefore I have written a C extension, which calculates the distance. The calculation is always done with 128-dimensional vectors.
#include "euclidean.h"
#include <math.h>
double euclidean(double x[128], double y[128])
{
double Sum;
for(int i=0;i<128;i++)
{
Sum = Sum + pow((x[i]-y[i]),2.0);
}
return Sum;
}
Complete code for the extension is here: https://gist.github.com/herrbuerger/bd63b73f3c5cf1cd51de
Now this gives a nice speedup in comparison to the numpy version.
But is there any way to speed this up further (this is my first C extension ever so I assume there is)? With the number of times this function is used every day, every microsecond would actually provide a benefit.
Some of you might suggest porting this completely from Python to another language, unfortunately this is a larger project and not an option :(
Thanks.
Edit
I have posted this question on CodeReview: https://codereview.stackexchange.com/questions/52218/possible-optimizations-for-calculating-squared-euclidean-distance
I will delete this question in an hour in case someone has started to write an answer.
The fastest way to compute Euclidean distances in NumPy that I know is the one in scikit-learn, which can be summed up as
def squared_distances(X, Y):
"""Return a distance matrix for each pair of rows i, j in X, Y."""
# http://stackoverflow.com/a/19094808/166749
X_row_norms = np.einsum('ij,ij->i', X, X)
Y_row_norms = np.einsum('ij,ij->i', Y, Y)
distances = np.dot(X, Y)
distances *= -2
distances += X_row_norms
distances += Y_row_norms
np.maximum(distances, 0, distances) # get rid of negatives; optional
return distances
The bottleneck in this piece of code is matrix multiplication (np.dot), so make sure your NumPy is linked against a good BLAS implementation; with a multithreaded BLAS on a multicore machine and large enough input matrices, it should be faster than anything you can whip up in C. Note that it relies on the binomial formula
||x - y||² = ||x||² + ||y||² - 2 x⋅y
and that either X_row_norms or Y_row_norms can be cached across invocations for the k-NN use case.
(I'm a coauthor of this code, and I spent quite some time optimizing both it and the SciPy implementation; scikit-learn is faster at the expense of some accuracy, but for k-NN that shouldn't matter too much. The SciPy implementation, available in scipy.spatial.distance, is actually an optimized version of the code you just wrote and is more accurate.)
Related
I have the following problem. I have a function f defined in python using numpy functions. The function is smooth and integrable on positive reals. I want to construct the double antiderivative of the function (assuming that both the value and the slope of the antiderivative at 0 are 0) so that I can evaluate it on any positive real smaller than 100.
Definition of antiderivative of f at x:
integrate f(s) with s from 0 to x
Definition of double antiderivative of f at x:
integrate (integrate f(t) with t from 0 to s) with s from 0 to x
The actual form of f is not important, so I will use a simple one for convenience. But please note that even though my example has a known closed form, my actual function does not.
import numpy as np
f = lambda x: np.exp(-x)*x
My solution is to construct the antiderivative as an array using naive numerical integration:
N = 10000
delta = 100/N
xs = np.linspace(0,100,N+1)
vs = f(xs)
avs = np.cumsum(vs)*delta
aavs = np.cumsum(avs)*delta
This of course works but it gives me arrays instead of functions. But this is not a big problem as I can interpolate aavs using a spline to get a function and get rid of the arrays.
from scipy.interpolate import UnivariateSpline
aaf = UnivariateSpline(xs, aavs)
The function aaf is approximately the double antiderivative of f.
The problem is that even though it works, there is quite a bit of overhead before I can get my function and precision is expensive.
My other idea was to interpolate f by a spline and take the antiderivative of that, however this introduces numerical errors that are too big for what I want to use the function.
Is there any better way to do that? By better I mean faster without sacrificing accuracy.
Edit: What I hope is possible is to use some kind of Fourier transform to avoid integrating twice. I hope that there is some convenient transform of vs that allows to multiply the values component-wise with xs and transform back to get the double antiderivative. I played with this a bit, but I got lost.
Edit: I figured out that by using the trapezoidal rule instead of a naive sum, increases the accuracy quite a bit. Using Simpson's rule should increase the accuracy further, but it's somewhat fiddly to do with numpy arrays.
Edit: As #user202729 rightfully complains, this seems off. The reason it seems off is because I have skipped some details. I explain here why what I say makes sense, but it does not affect my question.
My actual goal is not to find the double antiderivative of f, but to find a transformation of this. I have skipped that because I think it only confuses the matter.
The function f decays exponentially as x approaches 0 or infinity. I am minimizing the numerical error in the integration by starting the sum from 0 and going up to approximately the peak of f. This ensure that the relative error is approximately constant. Then I start from the opposite direction from some very big x and go back to the peak. Then I do the same for the antiderivative values.
Then I transform the aavs by another function which is sensitive to numerical errors. Then I find the region where the errors are big (the values oscillate violently) and drop these values. Finally I approximate what I believe are good values by a spline.
Now if I use spline to approximate f, it introduces an absolute error which is the dominant term in a rather large interval. This gets "integrated" twice and it ends up being a rather large relative error in aavs. Then once I transform aavs, I find that the 'good region' has shrunk considerably.
EDIT: The actual form of f is something I'm still looking into. However, it is going to be a generalisation of the lognormal distribution. Right now I am playing with the following family.
I start by defining a generalization of the normal distribution:
def pdf_n(params, center=0.0, slope=8):
scale, min, diff = params
if diff > 0:
r = min
l = min + diff
else:
r = min - diff
l = min
def retfun(m):
x = (m - center)/scale
E = special.expit(slope*x)*(r - l) + l
return np.exp( -np.power(1 + x*x, E)/2 )
return np.vectorize(retfun)
It may not be obvious what is happening here, but the result is quite simple. The function decays as exp(-x^(2l)) on the left and as exp(-x^(2r)) on the right. For min=1 and diff=0, this is the normal distribution. Note that this is not normalized. Then I define
g = pdf(params)
f = np.vectorize(lambda x:g(np.log(x))/x/area)
where area is the normalization constant.
Note that this is not the actual code I use. I stripped it down to the bare minimum.
You can compute the two np.cumsum (and the divisions) at once more efficiently using Numba. This is significantly faster since there is no need for several temporary arrays to be allocated, filled, read again and freed. Here is a naive implementation:
import numba as nb
#nb.njit('float64[::1](float64[::1], float64)') # Assume vs is contiguous
def doubleAntiderivative_naive(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
sum1, sum2 = 0.0, 0.0
for i in range(vs.size):
sum1 += vs[i] * delta
sum2 += sum1 * delta
res[i] = sum2
return res
However, the sum is not very good in term of numerical stability. A Kahan summation is needed to improve the accuracy (or possibly the alternative Kahan–Babuška-Klein algorithm if you are paranoid about the accuracy and performance do not matter so much). Note that Numpy use a pair-wise algorithm which is quite good but far from being prefect in term of accuracy (this is a good compromise for both performance and accuracy).
Moreover, delta can be factorized during in the summation (ie. the result just need to be premultiplied by delta**2).
Here is an implementation using the more accurate Kahan summation:
#nb.njit('float64[::1](float64[::1], float64)')
def doubleAntiderivative_accurate(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
delta2 = delta * delta
sum1, sum2 = 0.0, 0.0
c1, c2 = 0.0, 0.0
for i in range(vs.size):
# Kahan summation of the antiderivative of vs
y1 = vs[i] - c1
t1 = sum1 + y1
c1 = (t1 - sum1) - y1
sum1 = t1
# Kahan summation of the double antiderivative of vs
y2 = sum1 - c2
t2 = sum2 + y2
c2 = (t2 - sum2) - y2
sum2 = t2
res[i] = sum2 * delta2
return res
Here is the performance of the approaches on my machine (with an i5-9600KF processor):
Numpy cumsum: 51.3 us
Naive Numba: 11.6 us
Accutate Numba: 37.2 us
Here is the relative error of the approaches (based on the provided input function):
Numpy cumsum: 1e-13
Naive Numba: 5e-14
Accutate Numba: 2e-16
Perfect precision: 1e-16 (assuming 64-bit numbers are used)
If f can be easily computed using Numba (this is the case here), then vs[i] can be replaced by calls to f (inlined by Numba). This helps to reduce the memory consumption of the computation (N can be huge without saturating your RAM).
As for the interpolation, the splines often gives good numerical result but they are quite expensive to compute and AFAIK they require the whole array to be computed (each item of the array impact all the spline although some items may have a negligible impact alone). Regarding your needs, you could consider using Lagrange polynomials. You should be careful when using Lagrange polynomials on the edges. In your case, you can easily solve the numerical divergence issue on the edges by extending the array size with the border values (since you know the derivative on each edges of vs is 0). You can apply the interpolation on the fly with this method which can be good for both performance (typically if the computation is parallelized) and memory usage.
First, I created a version of the code I found more intuitive. Here I multiply cumulative sum values by bin widths. I believe there is a small error in the original version of the code related to the bin width issue.
import numpy as np
f = lambda x: np.exp(-x)*x
N = 1000
xs = np.linspace(0,100,N+1)
domainwidth = ( np.max(xs) - np.min(xs) )
binwidth = domainwidth / N
vs = f(xs)
avs = np.cumsum(vs)*binwidth
aavs = np.cumsum(avs)*binwidth
Next, for visualization here is some very simple plotting code:
import matplotlib
import matplotlib.pyplot as plt
plt.figure()
plt.scatter( xs, vs )
plt.figure()
plt.scatter( xs, avs )
plt.figure()
plt.scatter( xs, aavs )
plt.show()
The first integral matches the known result of the example expression and can be seen on wolfram
Below is a simple function that extracts an element from the second derivative. Note that int is a bad rounding function. I assume this is what you have implemented already.
def extract_double_antideriv_value(x):
return aavs[int(x/binwidth)]
singleresult = extract_double_antideriv_value(50.24)
print('singleresult', singleresult)
Whatever full computation steps are required, we need to know them before we can start optimizing. Do you have a million different functions to integrate? If you only need to query a single double anti-derivative many times, your original solution should be fairly ideal.
Symbolic Approximation:
Have you considered approximations to the original function f, which can have closed form integration solutions? You have a limited domain on which the function lives. Perhaps approximate f with a Taylor series (which can be constructed with known maximum error) then integrate exactly? (consider Pade, Taylor, Fourier, Cheby, Lagrange(as suggested by another answer), etc...)
Log Tricks:
Another alternative to dealing with spiky errors, would be to take the log of your original function. Is f always positive? Is the integration error caused because the neighborhood around the max is very small? If so, you can study ln(f) or even ln(ln(f)) instead. It would really help to understand what f looks like more.
Approximation Integration Tricks
There exist countless integration tricks in general, which can make approximate closed form solutions to undo-able integrals. A very common one when exponetnial functions are involved (I think yours is expoential?) is to use Laplace's Method. But which trick to pull out of the bag is highly dependent upon the conditions which f satisfies.
I want to multiply two polynomials fast in python. As my polynomials are rather large (> 100000) elements and I have to multiply lots of them. Below, you will find my approach,
from numpy.random import seed, randint
from numpy import polymul, pad
from numpy.fft import fft, ifft
from timeit import default_timer as timer
length=100
def test_mul(arr_a,arr_b): #inbuilt python multiplication
c=polymul(arr_a,arr_b)
return c
def sb_mul(arr_a,arr_b): #my schoolbook multiplication
c=[0]*(len(arr_a) + len(arr_b) - 1 )
for i in range( len(arr_a) ):
for j in range( len(arr_b) ):
k=i+j
c[k]=c[k]+arr_a[i]*arr_b[j]
return c
def fft_test(arr_a,arr_b): #fft based polynomial multuplication
arr_a1=pad(arr_a,(0,length),'constant')
arr_b1=pad(arr_b,(0,length),'constant')
a_f=fft(arr_a1)
b_f=fft(arr_b1)
c_f=[0]*(2*length)
for i in range( len(a_f) ):
c_f[i]=a_f[i]*b_f[i]
return c_f
if __name__ == '__main__':
seed(int(timer()))
random=1
if(random==1):
x=randint(1,1000,length)
y=randint(1,1000,length)
else:
x=[1]*length
y=[1]*length
start=timer()
res=test_mul(x,y)
end=timer()
print("time for built in pol_mul", end-start)
start=timer()
res1=sb_mul(x,y)
end=timer()
print("time for schoolbook mult", end-start)
res2=fft_test(x,y)
print(res2)
#########check############
if( len(res)!=len(res1) ):
print("ERROR");
for i in range( len(res) ):
if( res[i]!=res1[i] ):
print("ERROR at pos ",i,"res[i]:",res[i],"res1[i]:",res1[i])
Now, here are my approach in detail,
1. First, I tried myself with a naive implementation of Schoolbook with complexity O(n^2). But as you may expect it turned out to be very slow.
Second, I came to know polymul in the Numpy library. This function is a lot faster than the previous one. But I realized this is also a O(n^2) complexity. You can see, if you increase the length k the time increases by k^2 times.
My third approach is to try a FFT based multiplication using the inbuilt FFT functions. I followed the the well known approach also described here but Iam not able to get it work.
Now my questions are,
Where am I going wrong in my FFT based approach? Can you please tell me how can I fix it?
Is my observation that polymul function has O(n^2) complexity correct?
Please, let me know if you have any question.
Thanks in advance.
Where am I going wrong in my FFT based approach? Can you please tell me how can I fix it?
The main problem is that in the FFT based approach, you should be taking the inverse transform after the multiplication, but that step is missing from your code. With this missing step your code should look like the following:
def fft_test(arr_a,arr_b): #fft based polynomial multiplication
arr_a1=pad(arr_a,(0,length),'constant')
arr_b1=pad(arr_b,(0,length),'constant')
a_f=fft(arr_a1)
b_f=fft(arr_b1)
c_f=[0]*(2*length)
for i in range( len(a_f) ):
c_f[i]=a_f[i]*b_f[i]
return ifft(c_f)
Note that there may also a few opportunities for improvements:
The zero padding can be handled directly by passing the required FFT length as the second argument (e.g. a_f = fft(arr_a, length))
The coefficient multiplication in your for loop may be directly handled by numpy.multiply.
If the polynomial coefficients are real-valued, then you can use numpy.fft.rfft and numpy.fft.irfft (instead of numpy.fft.fft and numpy.fft.ifft) for some extra performance boost.
So an implementation for real-valued inputs may look like:
from numpy.fft import rfft, irfft
def fftrealpolymul(arr_a, arr_b): #fft based real-valued polynomial multiplication
L = len(arr_a) + len(arr_b)
a_f = rfft(arr_a, L)
b_f = rfft(arr_b, L)
return irfft(a_f * b_f)
Is my observation that polymul function has O(n2) complexity correct?
That also seem to be the performance I am observing, and matches the available code in my numpy installation (version 1.15.4, and there doesn't seem any change in that part in the more recent 1.16.1 version).
Right now I am doing assignments from cs 231 n , and I wanted to calculate euclidean distance between points:
dists[i, j]=0
for k in range(3072):
dists[i, j]+=math.pow((X[i,k] - self.X_train[j,k]),2)
dists[i, j] = math.sqrt(dists[i,j])
however, this code is very slow. Then I tried
dists[i,j] = dist = np.linalg.norm(X[i,:] - self.X_train[j,:])
which is way faster. The question is why? Doesn't np.linalg.norm also loop through all coordinates of all points, subtracts, puts into power, sums and squares them? Could someone give me a detailed answer : is it because of how does np.linalg.norm access elements or there is other reason?
NumPy can do the entire calculation in one fell swoop in optimized, accelerated (e.g. SSE, AVX, what-have-you) C code.
The original code does all of its work in Python (aside from the math functions, which are implemented in C, but also take time roundtripping Python objects), which just, well, is slower.
I am relatively new to python and am interested in any ideas to optimize and speed up this function. I have to call it tens~hundreds of thousands of times for a numerical computation I am doing and it takes a major fraction of the code's overall computational time.
I have written this in c, but I am interested to see any tricks to make it run faster in python specifically.
This code calculates a stereographic projection of a bigD-length vector to a littleD-length vector, per http://en.wikipedia.org/wiki/Stereographic_projection. The variable a is a numpy array of length ~ 96.
import numpy as np
def nsphere(a):
bigD = len(a)
littleD = 3
temp = a
# normalize before calculating projection
temp = temp/np.sqrt(np.dot(temp,temp))
# calculate projection
for i in xrange(bigD-littleD + 2,2,-1 ):
temp = temp[0:-1]/(1.0 - temp[-1])
return temp
#USAGE:
q = np.random.rand(96)
b = nsphere(q)
print b
This should be faster:
def nsphere(a, littleD=3):
a = a / np.sqrt(np.dot(a, a))
z = a[littleD:].sum()
return a[:littleD] / (1. - z)
Please do the math to double check that this is in fact the same as your iterative algorithm.
Obviously the main speedup here is going to come from the fact that this is a O(n) algorithm that replaces your O(n**2) algorithm for computing the projection. But specifically to speeding things up in python, you want to "vectorize your inner loop". Meaning try and avoid loops and anything else that is going to have high python overhead in the most performance critical parts of your code and instead try and use python and numpy builtins which are highly optimized. Hope that helps.
I'm trying to interpolate some data for the purpose of plotting. For instance, given N data points, I'd like to be able to generate a "smooth" plot, made up of 10*N or so interpolated data points.
My approach is to generate an N-by-10*N matrix and compute the inner product the original vector and the matrix I generated, yielding a 1-by-10*N vector. I've already worked out the math I'd like to use for the interpolation, but my code is pretty slow. I'm pretty new to Python, so I'm hopeful that some of the experts here can give me some ideas of ways I can try to speed up my code.
I think part of the problem is that generating the matrix requires 10*N^2 calls to the following function:
def sinc(x):
import math
try:
return math.sin(math.pi * x) / (math.pi * x)
except ZeroDivisionError:
return 1.0
(This comes from sampling theory. Essentially, I'm attempting to recreate a signal from its samples, and upsample it to a higher frequency.)
The matrix is generated by the following:
def resampleMatrix(Tso, Tsf, o, f):
from numpy import array as npar
retval = []
for i in range(f):
retval.append([sinc((Tsf*i - Tso*j)/Tso) for j in range(o)])
return npar(retval)
I'm considering breaking up the task into smaller pieces because I don't like the idea of an N^2 matrix sitting in memory. I could probably make 'resampleMatrix' into a generator function and do the inner product row-by-row, but I don't think that will speed up my code much until I start paging stuff in and out of memory.
Thanks in advance for your suggestions!
This is upsampling. See Help with resampling/upsampling for some example solutions.
A fast way to do this (for offline data, like your plotting application) is to use FFTs. This is what SciPy's native resample() function does. It assumes a periodic signal, though, so it's not exactly the same. See this reference:
Here’s the second issue regarding time-domain real signal interpolation, and it’s a big deal indeed. This exact interpolation algorithm provides correct results only if the original x(n) sequence is periodic within its full time interval.
Your function assumes the signal's samples are all 0 outside of the defined range, so the two methods will diverge away from the center point. If you pad the signal with lots of zeros first, it will produce a very close result. There are several more zeros past the edge of the plot not shown here:
Cubic interpolation won't be correct for resampling purposes. This example is an extreme case (near the sampling frequency), but as you can see, cubic interpolation isn't even close. For lower frequencies it should be pretty accurate.
If you want to interpolate data in a quite general and fast way, splines or polynomials are very useful. Scipy has the scipy.interpolate module, which is very useful. You can find many examples in the official pages.
Your question isn't entirely clear; you're trying to optimize the code you posted, right?
Re-writing sinc like this should speed it up considerably. This implementation avoids checking that the math module is imported on every call, doesn't do attribute access three times, and replaces exception handling with a conditional expression:
from math import sin, pi
def sinc(x):
return (sin(pi * x) / (pi * x)) if x != 0 else 1.0
You could also try avoiding creating the matrix twice (and holding it twice in parallel in memory) by creating a numpy.array directly (not from a list of lists):
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.zeros((f, o))
for i in xrange(f):
for j in xrange(o):
retval[i][j] = sinc((Tsf*i - Tso*j)/Tso)
return retval
(replace xrange with range on Python 3.0 and above)
Finally, you can create rows with numpy.arange as well as calling numpy.sinc on each row or even on the entire matrix:
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.zeros((f, o))
for i in xrange(f):
retval[i] = numpy.arange(Tsf*i / Tso, Tsf*i / Tso - o, -1.0)
return numpy.sinc(retval)
This should be significantly faster than your original implementation. Try different combinations of these ideas and test their performance, see which works out the best!
I'm not quite sure what you're trying to do, but there are some speedups you can do to create the matrix. Braincore's suggestion to use numpy.sinc is a first step, but the second is to realize that numpy functions want to work on numpy arrays, where they can do loops at C speen, and can do it faster than on individual elements.
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.sinc((Tsi*numpy.arange(i)[:,numpy.newaxis]
-Tso*numpy.arange(j)[numpy.newaxis,:])/Tso)
return retval
The trick is that by indexing the aranges with the numpy.newaxis, numpy converts the array with shape i to one with shape i x 1, and the array with shape j, to shape 1 x j. At the subtraction step, numpy will "broadcast" the each input to act as a i x j shaped array and the do the subtraction. ("Broadcast" is numpy's term, reflecting the fact no additional copy is made to stretch the i x 1 to i x j.)
Now the numpy.sinc can iterate over all the elements in compiled code, much quicker than any for-loop you could write.
(There's an additional speed-up available if you do the division before the subtraction, especially since inthe latter the division cancels the multiplication.)
The only drawback is that you now pay for an extra Nx10*N array to hold the difference. This might be a dealbreaker if N is large and memory is an issue.
Otherwise, you should be able to write this using numpy.convolve. From what little I just learned about sinc-interpolation, I'd say you want something like numpy.convolve(orig,numpy.sinc(numpy.arange(j)),mode="same"). But I'm probably wrong about the specifics.
If your only interest is to 'generate a "smooth" plot' I would just go with a simple polynomial spline curve fit:
For any two adjacent data points the coefficients of a third degree polynomial function can be computed from the coordinates of those data points and the two additional points to their left and right (disregarding boundary points.) This will generate points on a nice smooth curve with a continuous first dirivitive. There's a straight forward formula for converting 4 coordinates to 4 polynomial coefficients but I don't want to deprive you of the fun of looking it up ;o).
Here's a minimal example of 1d interpolation with scipy -- not as much fun as reinventing, but.
The plot looks like sinc, which is no coincidence:
try google spline resample "approximate sinc".
(Presumably less local / more taps ⇒ better approximation,
but I have no idea how local UnivariateSplines are.)
""" interpolate with scipy.interpolate.UnivariateSpline """
from __future__ import division
import numpy as np
from scipy.interpolate import UnivariateSpline
import pylab as pl
N = 10
H = 8
x = np.arange(N+1)
xup = np.arange( 0, N, 1/H )
y = np.zeros(N+1); y[N//2] = 100
interpolator = UnivariateSpline( x, y, k=3, s=0 ) # s=0 interpolates
yup = interpolator( xup )
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
print "yup:", yup
pl.plot( x, y, "green", xup, yup, "blue" )
pl.show()
Added feb 2010: see also basic-spline-interpolation-in-a-few-lines-of-numpy
Small improvement. Use the built-in numpy.sinc(x) function which runs in compiled C code.
Possible larger improvement: Can you do the interpolation on the fly (as the plotting occurs)? Or are you tied to a plotting library that only accepts a matrix?
I recommend that you check your algorithm, as it is a non-trivial problem. Specifically, I suggest you gain access to the article "Function Plotting Using Conic Splines" (IEEE Computer Graphics and Applications) by Hu and Pavlidis (1991). Their algorithm implementation allows for adaptive sampling of the function, such that the rendering time is smaller than with regularly spaced approaches.
The abstract follows:
A method is presented whereby, given a
mathematical description of a
function, a conic spline approximating
the plot of the function is produced.
Conic arcs were selected as the
primitive curves because there are
simple incremental plotting algorithms
for conics already included in some
device drivers, and there are simple
algorithms for local approximations by
conics. A split-and-merge algorithm
for choosing the knots adaptively,
according to shape analysis of the
original function based on its
first-order derivatives, is
introduced.