Parallel Scipy COO Matrix Computations

Parallel Scipy COO Matrix Computations - python

I am trying to calculate sparse matrix calculations using scipy for an algorithm that require intensive dependent computations(PageRank) on very large RDF datasets. I want to use multiple cores for the scipy calculation within the following code
F = sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
W = sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
P = sparse.bmat([[None, W], [F, None]])
previous = np.ones(n)/n
ones = np.ones(n)/n
while error > epsilon:
tmp = np.array(previous)
previous = damping*P.T.dot(previous) + (1-damping)*ones
error = np.linalg.norm(tmp - previous)
if(printerror):
print(error)
I have searched every possible answer I could find and I tried integrating the mkl(anaconda build) within the code but the performance on multiple cores does not seem to scale up. I have come to an understanding that the scipy call csr.h does not make use of BLAS call, I am wondering whether I need to make changes and replace the call to csr_matvec in from scipy/sparsetools with an appropriate Sparse BLAS call since MKL has those and then link scipy to mkl. Am I understanding something wrong or missing something. I would really appreciate some help in the matter. One similar question is here Thanks!!

Related

Does calling a numpy function in a vectorized operation affect performance?

I am new to python and currently studying the numPy package. I come from the C/C++ world, so maybe my question is stupid. When using vectorized operations in numPy, I assume that they parallelize the execution like openMP does.
I came across a piece of code in an udacity tutorial, which calculated a standardized 1D-array in the following way:
standardized = (array - array.mean()) / array.std()
where array is a numPy array. So in my eyes numPy would parallelize the following 'single' instructions to get a better performance:
standardized[0] = (array[0] - array.mean()) / array.std()
standardized[1] = (array[1] - array.mean()) / array.std()
...
...
standardized[n] = (array[n] - array.mean()) / array.std()
where n is the size of the array. So in every iteration, I would call mean() and std() which gets always calculated and therefore needs a lot of time. In a 'C way' I would do something like this, to increase performance:
mean = array.mean()
std = array.std()
standardized = (array - mean) / std
I measured times for both calculations and nearly got always the same time. In fact, it depends on which method I use first, which is the fastest. Additionally, I only used array filled with zeros, maybe this has an impact, too.
So my question is, how does python (or numPy) 'parallalize' the vectorized execution and how does it deal with function calls, which should always return the same value in one iteration.
I hope my questions are clear and understandable. I could not find any sources which deals with this use-case.

standardized = (array - array.mean()) / array.std()
is a Python expression which gets evaluated as:
temp1 = array.mean()
temp2 = array.std()
temp3 = (array - temp1)
temp4 = temp3 / temp2
array.mean is a numpy 'builtin' method, which means it's written in compiled code. Same for std. And for subtraction and division of two arrays.
numpy provides building blocks, python provides the glue to join them together. Generally the best strategy is to maximize the use of those numpy methods. And avoid loops at the Python level. Sometimes a few loops on a complex operation is better, and sometimes using basic Python is better (creating an array from lists takes time).
There are tools for building custom compiled blocks - cython, numba etc.

I'm not aware of any OpenMP-style parallelization in numpy. Speed-gains come from using C/Fortran/specialised libraries such as LAPack/BLAS etc. You can roll your own parallelization using multiprocessing if you can afford the marshaling cost.
There seems to be a way to enable OpenMP if you build yourself: https://docs.scipy.org/doc/scipy/reference/building/linux.html

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

How does scipy.integrate.ode.integrate() work?

I have obviously read through the documentation, but I have not been able to find a more detailed description of what is happening under the covers. Specifically, there are a few behaviors that I am very confused about:
General setup
import numpy as np
from scipy.integrate import ode
#Constants in ODE
N = 30
K = 0.5
w = np.random.normal(np.pi, 0.1, N)
#Integration parameters
y0 = np.linspace(0, 2*np.pi, N, endpoint=False)
t0 = 0
#Set up the solver
solver = ode(lambda t,y: w + K/N*np.sum( np.sin( y - y.reshape(N,1) ), axis=1))
solver.set_integrator('vode', method='bdf')
solver.set_initial_value(y0, t0)
Problem 1: solver.integrate(t0) fails
Setting up the integrator, and asking for the value at t0 the first time returns a successful integration. Repeating this returns the correct number, but the solver.successful() method returns false:
solver.integrate(t0)
>>> array([ 0. , 0.20943951, 0.41887902, ..., 5.65486678,
5.86430629, 6.0737458 ])
solver.successful()
>>> True
solver.integrate(t0)
>>> array([ 0. , 0.20943951, 0.41887902, ..., 5.65486678,
5.86430629, 6.0737458 ])
solver.successful()
>>> False
My question is, what is happening in the solver.integrate(t) method that causes it to succeed the first time, and fail subsequently, and what does it mean to have an “unsuccessful” integration? Furthermore, why does the integrator fail silently, and continue to produce useful-looking outputs until I ask it explicitly whether it was successful?
Related, is there a way to reset the failed integration, or do I need to re-instantiate the solver from scratch?
Problem 2: solver.integrate(t) immediately returns an answer for almost any value of t
Even though my initial value of y0 is given at t0=0, I can request the value at t=10000 and get the answer immediately. I would expect that the numerical integration over such a large time span should take at least a few seconds (e.g. in Matlab, asking to integrate over 10000 time steps would take several minutes).
For example, re-run the setup from above and execute:
solver.integrate(10000)
>>> array([ 2153.90803383, 2153.63023706, 2153.60964064, ..., 2160.00982959,
2159.90446056, 2159.82900895])
Is Python really that fast, or is this output total nonsense?

Problem 0
Don’t ignore error messages. Yes, ode’s error messages can be cryptic at times, but you still want to avoid them.
Problem 1
As you already integrated up to t0 with the first call of solver.integrate(t0), you are integrating for a time step of 0 with the second call. This throws the cryptic error:
DVODE-- ISTATE (=I1) .gt. 1 but DVODE not initialized
In above message, I1 = 2
/usr/lib/python3/dist-packages/scipy/integrate/_ode.py:869: UserWarning: vode: Illegal input detected. (See printed message.)
'Unexpected istate=%s' % istate))
Problem 2.1
There is a maximum number of (internal) steps that a solver is going to take in one call without throwing an error. This can be set with the nsteps argument of set_integrator. If you integrate a large time at once, nsteps will be exceeded even if nothing is wrong, and the following error message is thrown:
/usr/lib/python3/dist-packages/scipy/integrate/_ode.py:869: UserWarning: vode: Excess work done on this call. (Perhaps wrong MF.)
'Unexpected istate=%s' % istate))
The integrator then stops at whenever this happens.
Problem 2.2
If you set nsteps=10**10, the integration runs without problems. It still is pretty fast though (roughly 1 s on my machine). The reason for this is as follows:
For a multi-dimensional system such as yours, there are two main runtime sinks when integrating:
Vector and matrix operations within the integrator. In scipy.ode, these are all realised with NumPy operations or ported Fortran or C code. Anyway, they are realised with compiled code without Python overhead and thus very efficient.
Evaluating the derivative (lambda t,y: w + K/N*np.sum( np.sin( y - y.reshape(N,1) ), axis=1) in your case). You realised this with NumPy operations, which again are realised with compiled code and very efficient. You may improve this a little bit with a purely compiled function, but that will grant you at most a small factor. If you used Python lists and loops instead, it would be horribly slow.
Therefore, for your problem, everything relevant is handled with compiled code under the hood and the integration is handled with an efficiency comparable to that of, e.g., a pure C program. I do not know how the two above aspects are handled in Matlab, but if either of the above challenges is handled with interpreted instead of compiled loops, this would explain the runtime discrepancy you observe.

To the second question, yes, the output might be nonsense. Local errors, be they from discretization or floating point operations, accumulate with a compounding factor which is about the Lipschitz constant of the ODE function. In a first estimate, the Lipschitz constant here is K=0.5. The magnification rate of early errors, that is, their coefficient as part of the global error, can thus be as large as exp(0.5*10000), which is a huge number.
On the other hand it is not surprising that the integration is fast. Most of the provided methods use step size adaptation, and with the standard error tolerances this might result in only some tens of internal steps. Reducing the error tolerances will increase the number of internal steps and may change the numerical result drastically.

ValueError: Linkage 'Z' uses the same cluster more than once in Python scipy fcluster

I'm getting ValueError: Linkage 'Z' uses the same cluster more than once. when trying to get flat clusters in Python with scipy.cluster.hierarchy.fcluster. This error happens only sometimes, usually only with really big matrices ie 10000x10000.
import scipy.cluster.hierarchy as sch
Z = sch.linkage(d, method="ward")
# some computation here, returning n (usually between 5-30)
clusters = sch.fcluster(Z, t=n, criterion='maxclust')
Why does it happen? How can I avoid it? Unfortunately I couldn't find any useful info by googling...
EDIT Error occurs also when trying to get dendrogram.
No such error appear if method='average' is used.

It seems using fastcluster instead of scipy.cluster.hierarchy solves the problem. In addition, fastcluster implementation is slightly faster than scipy.
For more details have a look at the paper.
import fastcluster
Z = fastcluster.linkage(d, method="ward")
# some computation here, returning n (usually between 5-30)
clusters = fastcluster.fcluster(Z, t=n, criterion='maxclust')

Python NumPy vs Octave/MATLAB precision

This question is about precision of computation using NumPy vs. Octave/MATLAB (the MATLAB code below has only been tested with Octave, however). I am aware of a similar question on Stackoverflow, namely this, but that seems somewhat far from what I'm asking below.
Setup
Everything is running on Ubuntu 14.04.
Python version 3.4.0.
NumPy version 1.8.1 compiled against OpenBLAS.
Octave version 3.8.1 compiled against OpenBLAS.
Sample Code
Sample Python code.
import numpy as np
from scipy import linalg as la
def build_laplacian(n):
lap=np.zeros([n,n])
for j in range(n-1):
lap[j+1][j]=1
lap[j][j+1]=1
lap[n-1][n-2]=1
lap[n-2][n-1]=1
return lap
def evolve(s, lap):
wave=la.expm(-1j*s*lap).dot([1]+[0]*(lap.shape[0]-1))
for i in range(len(wave)):
wave[i]=np.linalg.norm(wave[i])**2
return wave
We now run the following.
np.min(evolve(2, build_laplacian(500)))
which gives something on the order of e-34.
We can produce similar code in Octave/MATLAB:
function lap=build_laplacian(n)
lap=zeros(n,n);
for i=1:(n-1)
lap(i+1,i)=1;
lap(i,i+1)=1;
end
lap(n,n-1)=1;
lap(n-1,n)=1;
end
function result=evolve(s, lap)
d=zeros(length(lap(:,1)),1); d(1)=1;
result=expm(-1i*s*lap)*d;
for i=1:length(result)
result(i)=norm(result(i))^2;
end
end
We then run
min(evolve(2, build_laplacian(500)))
and get 0. In fact, evolve(2, build_laplacian(500)))(60) gives something around e-100 or less (as expected).
The Question
Does anyone know what would be responsible for such a large discrepancy between NumPy and Octave (again, I haven't tested the code with MATLAB, but I'd expect to see similar results).
Of course, one can also compute the matrix exponential by first diagonalizing the matrix. I have done this and have gotten similar or worse results (with NumPy).
EDITS
My scipy version is 0.14.0. I am aware that Octave/MATLAB use the Pade approximation scheme, and am familiar with this algorithm. I am not sure what scipy does, but we can try the following.
Diagonalize the matrix with numpy's eig or eigh (in our case the latter works fine since the matrix is Hermitian). As a result we get two matrices: a diagonal matrix D, and the matrix U, with D consisting of eigenvalues of the original matrix on the diagonal, and U consists of the corresponding eigenvectors as columns; so that the original matrix is given by U.T.dot(D).dot(U).
Exponentiate D (this is now easy since D is diagonal).
Now, if M is the original matrix and d is the original vector d=[1]+[0]*n, we get scipy.linalg.expm(-1j*s*M).dot(d)=U.T.dot(numpy.exp(-1j*s*D).dot(U.dot(d)).
Unfortunately, this produces the same result as before. Thus this probably has something to do either with the way numpy.linalg.eig and numpy.linalg.eigh work, or with the way numpy does arithmetic internally.
So the question is: how do we increase numpy's precision? Indeed, as mentioned above, Octave seems to do a much finer job in this case.

The following code
import numpy as np
from scipy import linalg as la
import scipy
print np.__version__
print scipy.__version__
def build_laplacian(n):
lap=np.zeros([n,n])
for j in range(n-1):
lap[j+1][j]=1
lap[j][j+1]=1
lap[n-1][n-2]=1
lap[n-2][n-1]=1
return lap
def evolve(s, lap):
wave=la.expm(-1j*s*lap).dot([1]+[0]*(lap.shape[0]-1))
for i in range(len(wave)):
wave[i]=la.norm(wave[i])**2
return wave
r = evolve(2, build_laplacian(500))
print np.min(abs(r))
print r[59]
prints
1.8.1
0.14.0
0
(2.77560227344e-101+0j)
for me, with OpenBLAS 0.2.8-6ubuntu1.
So it appears your problem is not immediately reproduced. Your code examples above are not runnable as-is (typos).
As mentioned in scipy.linalg.expm documentation, the algorithm is from Al-Mohy and Higham (2009), which is different from the simpler scale-and-square-Pade in Octave.
As a consequence, the results also I get from Octave are slightly different, although the results are eps-close in matrix norms (1,2,inf). MATLAB uses the Pade approach from Higham (2005), which seems to give the same results as Scipy above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.