Why is the performance of the same loop algorithm differs

Why is the performance of the same loop algorithm differs - python

Right now I am doing assignments from cs 231 n , and I wanted to calculate euclidean distance between points:
dists[i, j]=0
for k in range(3072):
dists[i, j]+=math.pow((X[i,k] - self.X_train[j,k]),2)
dists[i, j] = math.sqrt(dists[i,j])
however, this code is very slow. Then I tried
dists[i,j] = dist = np.linalg.norm(X[i,:] - self.X_train[j,:])
which is way faster. The question is why? Doesn't np.linalg.norm also loop through all coordinates of all points, subtracts, puts into power, sums and squares them? Could someone give me a detailed answer : is it because of how does np.linalg.norm access elements or there is other reason?

NumPy can do the entire calculation in one fell swoop in optimized, accelerated (e.g. SSE, AVX, what-have-you) C code.
The original code does all of its work in Python (aside from the math functions, which are implemented in C, but also take time roundtripping Python objects), which just, well, is slower.

Related

Possible optimizations for calculating squared euclidean distance

I need to do a few hundred million euclidean distance calculations every day in a Python project.
Here is what I started out with:
def euclidean_dist_square(x, y):
diff = np.array(x) - np.array(y)
return np.dot(diff, diff)
This is quite fast and I already dropped the sqrt calculation since I need to rank items only (nearest-neighbor search). It is still the bottleneck of the script though. Therefore I have written a C extension, which calculates the distance. The calculation is always done with 128-dimensional vectors.
#include "euclidean.h"
#include <math.h>
double euclidean(double x[128], double y[128])
{
double Sum;
for(int i=0;i<128;i++)
{
Sum = Sum + pow((x[i]-y[i]),2.0);
}
return Sum;
}
Complete code for the extension is here: https://gist.github.com/herrbuerger/bd63b73f3c5cf1cd51de
Now this gives a nice speedup in comparison to the numpy version.
But is there any way to speed this up further (this is my first C extension ever so I assume there is)? With the number of times this function is used every day, every microsecond would actually provide a benefit.
Some of you might suggest porting this completely from Python to another language, unfortunately this is a larger project and not an option :(
Thanks.
Edit
I have posted this question on CodeReview: https://codereview.stackexchange.com/questions/52218/possible-optimizations-for-calculating-squared-euclidean-distance
I will delete this question in an hour in case someone has started to write an answer.

The fastest way to compute Euclidean distances in NumPy that I know is the one in scikit-learn, which can be summed up as
def squared_distances(X, Y):
"""Return a distance matrix for each pair of rows i, j in X, Y."""
# http://stackoverflow.com/a/19094808/166749
X_row_norms = np.einsum('ij,ij->i', X, X)
Y_row_norms = np.einsum('ij,ij->i', Y, Y)
distances = np.dot(X, Y)
distances *= -2
distances += X_row_norms
distances += Y_row_norms
np.maximum(distances, 0, distances) # get rid of negatives; optional
return distances
The bottleneck in this piece of code is matrix multiplication (np.dot), so make sure your NumPy is linked against a good BLAS implementation; with a multithreaded BLAS on a multicore machine and large enough input matrices, it should be faster than anything you can whip up in C. Note that it relies on the binomial formula
||x - y||² = ||x||² + ||y||² - 2 x⋅y
and that either X_row_norms or Y_row_norms can be cached across invocations for the k-NN use case.
(I'm a coauthor of this code, and I spent quite some time optimizing both it and the SciPy implementation; scikit-learn is faster at the expense of some accuracy, but for k-NN that shouldn't matter too much. The SciPy implementation, available in scipy.spatial.distance, is actually an optimized version of the code you just wrote and is more accurate.)

python numpy optimization n-dimensional projection

I am relatively new to python and am interested in any ideas to optimize and speed up this function. I have to call it tens~hundreds of thousands of times for a numerical computation I am doing and it takes a major fraction of the code's overall computational time.
I have written this in c, but I am interested to see any tricks to make it run faster in python specifically.
This code calculates a stereographic projection of a bigD-length vector to a littleD-length vector, per http://en.wikipedia.org/wiki/Stereographic_projection. The variable a is a numpy array of length ~ 96.
import numpy as np
def nsphere(a):
bigD = len(a)
littleD = 3
temp = a
# normalize before calculating projection
temp = temp/np.sqrt(np.dot(temp,temp))
# calculate projection
for i in xrange(bigD-littleD + 2,2,-1 ):
temp = temp[0:-1]/(1.0 - temp[-1])
return temp
#USAGE:
q = np.random.rand(96)
b = nsphere(q)
print b

This should be faster:
def nsphere(a, littleD=3):
a = a / np.sqrt(np.dot(a, a))
z = a[littleD:].sum()
return a[:littleD] / (1. - z)
Please do the math to double check that this is in fact the same as your iterative algorithm.
Obviously the main speedup here is going to come from the fact that this is a O(n) algorithm that replaces your O(n**2) algorithm for computing the projection. But specifically to speeding things up in python, you want to "vectorize your inner loop". Meaning try and avoid loops and anything else that is going to have high python overhead in the most performance critical parts of your code and instead try and use python and numpy builtins which are highly optimized. Hope that helps.

Numpy, all pairwise correlations of a 3d array

I have an array of shape (l,m,n). I'm trying to calculate a distance matrix of shape (l,m,n) where entry (i,j,k) is the coefficient between vectors (i,j,:) and (i,:,k). I haven't found anything in numpy or scipy that fits the bill.
I tried using a for loop and iterating along axis 0, then feeding that to scipy.spatial.distance.pdist, but that takes a long time as pdist itself uses a nested for loop. In essence, what I would like to do would be to perform pdist down axis 0, but ideally make it so pdist doesn't use for loops either....
Any thoughts?

I would personally write a little Cython function to do this ( http://cython.org). Write and test an iterative pure Python version (with for loops), move it to a .pyx Cython file, add type declarations and follow the NumPy integration guide:
http://docs.cython.org/src/tutorial/numpy.html
Might seem like work but if you're doing computing in Python, some basic Cython skills are well worth cultivating as it makes writing C extensions much easier.

Any thoughts?
First thought is that you cannot compute such distances as long as m != n
Second thought is that internal loops of pdist should not bother you if those are written in C, so the probable reason is not in implementation, but in the amount of computations needed
Final thought is that your problem may be solved by numpy.einsum and linear algebra:
Code (which I assume to be optimal):
products = numpy.einsum('ijl, ilk -> ijk')
distances = numpy.einsum('ijj -> ij', products)
distances = distances[:, :, None] + distances[:, None, :] - 2 * product

pseudo inverse of sparse matrix in python

I am working with data from neuroimaging and because of the large amount of data, I would like to use sparse matrices for my code (scipy.sparse.lil_matrix or csr_matrix).
In particular, I will need to compute the pseudo-inverse of my matrix to solve a least-square problem.
I have found the method sparse.lsqr, but it is not very efficient. Is there a method to compute the pseudo-inverse of Moore-Penrose (correspondent to pinv for normal matrices).
The size of my matrix A is about 600'000x2000 and in every row of the matrix I'll have from 0 up to 4 non zero values. The matrix A size is given by voxel x fiber bundle (white matter fiber tracts) and we are expecting maximum 4 tracts to cross in a voxel. In most of the white matter voxels we expect to have at least 1 tract, but I will say that around 20% of the lines could be zeros.
The vector b should not be sparse, actually b contains the measure for each voxel, which is in general not zero.
I would need to minimize the error, but there are also some conditions on the vector x. As I tried the model on smaller matrices, I never needed to constrain the system in order to satisfy these conditions (in general 0
Is that of any help? Is there a way to avoid taking the pseudo-inverse of A?
Thanks
Update 1st June:
thanks again for the help.
I can't really show you anything about my data, because the code in python give me some problems. However, in order to understand how I could choose a good k I've tried to create a testing function in Matlab.
The code is as follow:
F=zeros(100000,1000);
for k=1:150000
p=rand(1);
a=0;
b=0;
while a<=0 || b<=0
a=random('Binomial',100000,p);
b=random('Binomial',1000,p);
end
F(a,b)=rand(1);
end
solution=repmat([0.5,0.5,0.8,0.7,0.9,0.4,0.7,0.7,0.9,0.6],1,100);
size(solution)
solution=solution';
measure=F*solution;
%check=pinvF*measure;
k=250;
F=sparse(F);
[U,S,V]=svds(F,k);
s=svds(F,k);
plot(s)
max(max(U*S*V'-F))
for s=1:k
if S(s,s)~=0
S(s,s)=1/S(s,s);
end
end
inv=V*S'*U';
inv*measure
max(inv*measure-solution)
Do you have any idea of what should be k compare to the size of F? I've taken 250 (over 1000) and the results are not satisfactory (the waiting time is acceptable, but not short).
Also now I can compare the results with the known solution, but how could one choose k in general?
I also attached the plot of the 250 single values that I get and their squares normalized. I don't know exactly how to better do a screeplot in matlab. I'm now proceeding with bigger k to see if suddently the value will be much smaller.
Thanks again,
Jennifer

You could study more on the alternatives offered in scipy.sparse.linalg.
Anyway, please note that a pseudo-inverse of a sparse matrix is most likely to be a (very) dense one, so it's not really a fruitful avenue (in general) to follow, when solving sparse linear systems.
You may like to describe a slight more detailed manner your particular problem (dot(A, x)= b+ e). At least specify:
'typical' size of A
'typical' percentage of nonzero entries in A
least-squares implies that norm(e) is minimized, but please indicate whether your main interest is on x_hat or on b_hat, where e= b- b_hat and b_hat= dot(A, x_hat)
Update: If you have some idea of the rank of A (and its much smaller than number of columns), you could try total least squares method. Here is a simple implementation, where k is the number of first singular values and vectors to use (i.e. 'effective' rank).
from scipy.sparse import hstack
from scipy.sparse.linalg import svds
def tls(A, b, k= 6):
"""A tls solution of Ax= b, for sparse A."""
u, s, v= svds(hstack([A, b]), k)
return v[-1, :-1]/ -v[-1, -1]

Regardless of the answer to my comment, I would think you could accomplish this fairly easily using the Moore-Penrose SVD representation. Find the SVD with scipy.sparse.linalg.svds, replace Sigma by its pseudoinverse, and then multiply V*Sigma_pi*U' to find the pseudoinverse of your original matrix.

resampling, interpolating matrix

I'm trying to interpolate some data for the purpose of plotting. For instance, given N data points, I'd like to be able to generate a "smooth" plot, made up of 10*N or so interpolated data points.
My approach is to generate an N-by-10*N matrix and compute the inner product the original vector and the matrix I generated, yielding a 1-by-10*N vector. I've already worked out the math I'd like to use for the interpolation, but my code is pretty slow. I'm pretty new to Python, so I'm hopeful that some of the experts here can give me some ideas of ways I can try to speed up my code.
I think part of the problem is that generating the matrix requires 10*N^2 calls to the following function:
def sinc(x):
import math
try:
return math.sin(math.pi * x) / (math.pi * x)
except ZeroDivisionError:
return 1.0
(This comes from sampling theory. Essentially, I'm attempting to recreate a signal from its samples, and upsample it to a higher frequency.)
The matrix is generated by the following:
def resampleMatrix(Tso, Tsf, o, f):
from numpy import array as npar
retval = []
for i in range(f):
retval.append([sinc((Tsf*i - Tso*j)/Tso) for j in range(o)])
return npar(retval)
I'm considering breaking up the task into smaller pieces because I don't like the idea of an N^2 matrix sitting in memory. I could probably make 'resampleMatrix' into a generator function and do the inner product row-by-row, but I don't think that will speed up my code much until I start paging stuff in and out of memory.
Thanks in advance for your suggestions!

This is upsampling. See Help with resampling/upsampling for some example solutions.
A fast way to do this (for offline data, like your plotting application) is to use FFTs. This is what SciPy's native resample() function does. It assumes a periodic signal, though, so it's not exactly the same. See this reference:
Here’s the second issue regarding time-domain real signal interpolation, and it’s a big deal indeed. This exact interpolation algorithm provides correct results only if the original x(n) sequence is periodic within its full time interval.
Your function assumes the signal's samples are all 0 outside of the defined range, so the two methods will diverge away from the center point. If you pad the signal with lots of zeros first, it will produce a very close result. There are several more zeros past the edge of the plot not shown here:
Cubic interpolation won't be correct for resampling purposes. This example is an extreme case (near the sampling frequency), but as you can see, cubic interpolation isn't even close. For lower frequencies it should be pretty accurate.

If you want to interpolate data in a quite general and fast way, splines or polynomials are very useful. Scipy has the scipy.interpolate module, which is very useful. You can find many examples in the official pages.

Your question isn't entirely clear; you're trying to optimize the code you posted, right?
Re-writing sinc like this should speed it up considerably. This implementation avoids checking that the math module is imported on every call, doesn't do attribute access three times, and replaces exception handling with a conditional expression:
from math import sin, pi
def sinc(x):
return (sin(pi * x) / (pi * x)) if x != 0 else 1.0
You could also try avoiding creating the matrix twice (and holding it twice in parallel in memory) by creating a numpy.array directly (not from a list of lists):
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.zeros((f, o))
for i in xrange(f):
for j in xrange(o):
retval[i][j] = sinc((Tsf*i - Tso*j)/Tso)
return retval
(replace xrange with range on Python 3.0 and above)
Finally, you can create rows with numpy.arange as well as calling numpy.sinc on each row or even on the entire matrix:
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.zeros((f, o))
for i in xrange(f):
retval[i] = numpy.arange(Tsf*i / Tso, Tsf*i / Tso - o, -1.0)
return numpy.sinc(retval)
This should be significantly faster than your original implementation. Try different combinations of these ideas and test their performance, see which works out the best!

I'm not quite sure what you're trying to do, but there are some speedups you can do to create the matrix. Braincore's suggestion to use numpy.sinc is a first step, but the second is to realize that numpy functions want to work on numpy arrays, where they can do loops at C speen, and can do it faster than on individual elements.
def resampleMatrix(Tso, Tsf, o, f):
retval = numpy.sinc((Tsi*numpy.arange(i)[:,numpy.newaxis]
-Tso*numpy.arange(j)[numpy.newaxis,:])/Tso)
return retval
The trick is that by indexing the aranges with the numpy.newaxis, numpy converts the array with shape i to one with shape i x 1, and the array with shape j, to shape 1 x j. At the subtraction step, numpy will "broadcast" the each input to act as a i x j shaped array and the do the subtraction. ("Broadcast" is numpy's term, reflecting the fact no additional copy is made to stretch the i x 1 to i x j.)
Now the numpy.sinc can iterate over all the elements in compiled code, much quicker than any for-loop you could write.
(There's an additional speed-up available if you do the division before the subtraction, especially since inthe latter the division cancels the multiplication.)
The only drawback is that you now pay for an extra Nx10*N array to hold the difference. This might be a dealbreaker if N is large and memory is an issue.
Otherwise, you should be able to write this using numpy.convolve. From what little I just learned about sinc-interpolation, I'd say you want something like numpy.convolve(orig,numpy.sinc(numpy.arange(j)),mode="same"). But I'm probably wrong about the specifics.

If your only interest is to 'generate a "smooth" plot' I would just go with a simple polynomial spline curve fit:
For any two adjacent data points the coefficients of a third degree polynomial function can be computed from the coordinates of those data points and the two additional points to their left and right (disregarding boundary points.) This will generate points on a nice smooth curve with a continuous first dirivitive. There's a straight forward formula for converting 4 coordinates to 4 polynomial coefficients but I don't want to deprive you of the fun of looking it up ;o).

Here's a minimal example of 1d interpolation with scipy -- not as much fun as reinventing, but.
The plot looks like sinc, which is no coincidence:
try google spline resample "approximate sinc".
(Presumably less local / more taps ⇒ better approximation,
but I have no idea how local UnivariateSplines are.)
""" interpolate with scipy.interpolate.UnivariateSpline """
from __future__ import division
import numpy as np
from scipy.interpolate import UnivariateSpline
import pylab as pl
N = 10
H = 8
x = np.arange(N+1)
xup = np.arange( 0, N, 1/H )
y = np.zeros(N+1); y[N//2] = 100
interpolator = UnivariateSpline( x, y, k=3, s=0 ) # s=0 interpolates
yup = interpolator( xup )
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
print "yup:", yup
pl.plot( x, y, "green", xup, yup, "blue" )
pl.show()
Added feb 2010: see also basic-spline-interpolation-in-a-few-lines-of-numpy

Small improvement. Use the built-in numpy.sinc(x) function which runs in compiled C code.
Possible larger improvement: Can you do the interpolation on the fly (as the plotting occurs)? Or are you tied to a plotting library that only accepts a matrix?

I recommend that you check your algorithm, as it is a non-trivial problem. Specifically, I suggest you gain access to the article "Function Plotting Using Conic Splines" (IEEE Computer Graphics and Applications) by Hu and Pavlidis (1991). Their algorithm implementation allows for adaptive sampling of the function, such that the rendering time is smaller than with regularly spaced approaches.
The abstract follows:
A method is presented whereby, given a
mathematical description of a
function, a conic spline approximating
the plot of the function is produced.
Conic arcs were selected as the
primitive curves because there are
simple incremental plotting algorithms
for conics already included in some
device drivers, and there are simple
algorithms for local approximations by
conics. A split-and-merge algorithm
for choosing the knots adaptively,
according to shape analysis of the
original function based on its
first-order derivatives, is
introduced.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is the performance of the same loop algorithm differs - python

Related

Possible optimizations for calculating squared euclidean distance

python numpy optimization n-dimensional projection

Numpy, all pairwise correlations of a 3d array

pseudo inverse of sparse matrix in python

resampling, interpolating matrix

Categories

Resources