Scipy's correlate function is slow - python

I have compared the different methods for convolving/correlating two signals using numpy/scipy. It turns out that there are huge differences in speed. I compared the follwing methods:
correlate from the numpy package (np.correlate in plot)
correlate from the scipy.signal package (sps.correlate in plot)
fftconvolve from scipy.signal (sps.fftconvolve in plot)
Now I of course understand that there is a considerable difference between fftconvolve and the other two functions. What I do not understand is why the sps.correlate is so much slower than np.correlate. Does anybody know why scipy uses an implementation that is so much slower?
For completeness, here is the code that produces the plot:
import time
import numpy as np
import scipy.signal as sps
from matplotlib import pyplot as plt
if __name__ == '__main__':
a = 10**(np.arange(10)/2)
print(a)
results = {}
results['np.correlate'] = np.zeros(len(a))
results['sps.correlate'] = np.zeros(len(a))
results['sps.fftconvolve'] = np.zeros(len(a))
ii = 0
for length in a:
sig = np.random.rand(length)
t0 = time.clock()
for jj in range(3):
np.correlate(sig, sig, 'full')
t1 = time.clock()
elapsed = (t1-t0)/3
results['np.correlate'][ii] = elapsed
t0 = time.clock()
for jj in range(3):
sps.correlate(sig, sig, 'full')
t1 = time.clock()
elapsed = (t1-t0)/3
results['sps.correlate'][ii] = elapsed
t0 = time.clock()
for jj in range(3):
sps.fftconvolve(sig, sig, 'full')
t1 = time.clock()
elapsed = (t1-t0)/3
results['sps.fftconvolve'][ii] = elapsed
ii += 1
ax = plt.figure()
plt.loglog(a, results['np.correlate'], label='np.correlate')
plt.loglog(a, results['sps.correlate'], label='sps.correlate')
plt.loglog(a, results['sps.fftconvolve'], label='sps.fftconvolve')
plt.xlabel('Signal length')
plt.ylabel('Elapsed time in seconds')
plt.legend()
plt.grid()
plt.show()

According to the documentation, numpy.correlate was designed for 1D arrays, while scipy.correlate can accept ND-arrays.
The scipy implementation being more general and therefore complex, seem indeed to incur an additional computational overhead. You can compare the C code between numpy and scipy implementations.
Another difference, could be for instance, that numpy implementation gets better vectorized by the compiler on modern processors, etc.

Related

Fast way to generate large-scale random ndarray

I want to generate a random matrix of shape (1e7, 800). But I find numpy.random.rand() becomes very slow at this scale. Is there a quicker way?
A simple way to do that is to write a multi-threaded implementation using Numba:
import numba as nb
import random
#nb.njit('float64[:,:](int_, int_)', parallel=True)
def genRandom(n, m):
res = np.empty((n, m))
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.rand()
return res
This is 6.4 times faster than np.random.rand() on my 6-core machine.
Note that using 32-bit floats may help to speed up a bit the computation although the precision will be lower.
Numba is a good option, another option that might work well is dask.array, which will create lazy blocks of numpy arrays and perform parallel computations on blocks. On my machine I get a factor of 2 improvement in speed (for 1e6 x 1e3 matrix since I don't have enough memory on my machine).
rows = 10**6
cols = 10**3
import dask.array as da
x = da.random.random(size=(rows, cols)).compute() # takes about 5 seconds
# import numpy as np
# x = np.random.rand(rows, cols) # takes about 10 seconds
Note that .compute at the end is only to bring the computed array into memory, however in general you can continue to exploit the parallel computations with dask to get much faster calculations (that can also scale beyond a single machine), see docs.
An attempt to find an answer from answers given till now:
I just wrote a script which is compiled from already given (by SultanOrazbayev and Jérôme Richard) answers and contains 3 functions for each numba, dask and numpy approach and measure the time spent for n number of different sized arrays.
The code
import dask.array as da
import matplotlib.pyplot as plt
import numba as nb
import timeit
import numpy as np
#nb.njit('float64[:,:](int_, int_)', parallel=True)
def nmb(n, m):
res = np.empty((n, m))
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.rand()
return res
def nmp(n, m):
return np.random.random((n, m))
def dask(n, m):
return da.random.random(size=(n, m)).compute()
if __name__ == '__main__':
data = []
for i in range(1, 16):
dmm = 2 ** i
s_nmb = timeit.default_timer()
nmb(dmm, dmm)
e_nmb = timeit.default_timer()
s_nmp = timeit.default_timer()
nmp(dmm, dmm)
e_nmp = timeit.default_timer()
s_dask = timeit.default_timer()
dask(dmm, dmm)
e_dask = timeit.default_timer()
data.append([
dmm,
e_nmb - s_nmb,
e_nmp - s_nmp,
e_dask - s_dask
])
data = np.array(data)
plt.plot(data[:, 0], data[:, 1], "-r", label="Numba")
plt.plot(data[:, 0], data[:, 2], "-g", label="Numpy")
plt.plot(data[:, 0], data[:, 3], "-b", label="Dask")
plt.xlabel("Number of Element on each axes")
plt.ylabel("Time spent (s)")
plt.legend()
plt.show()
The result

slow quadratic constraint creation in Pyomo

trying to construct a large scale quadratic constraint in Pyomo as follows:
import pyomo as pyo
from pyomo.environ import *
scale = 5000
pyo.n = Set(initialize=range(scale))
pyo.x = Var(pyo.n, bounds=(-1.0,1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q_values = dict(zip(list(itertools.product(range(0,scale), range(0,scale))), Q.flatten()))
pyo.Q = Param(pyo.n, pyo.n, initialize=Q_values)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*pyo.Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
turns out the last line is unbearably slow given the problem scale. tried several things mentioned in PyPSA, Performance of creating Pyomo constraints and pyomo seems very slow to write models. but no luck.
any suggestion (once the model was constructed, Ipopt solving was also slow. but that's independent from Pyomo i guess)?
ps: construct such quadratic constraint directly as follows didnt help either (also unbearably slow)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
You can get a small speed-up by using quicksum in place of sum. To measure the performance of the last line, I modified your code a little bit, as shown:
import itertools
from pyomo.environ import *
import time
import numpy as np
scale = 5000
m = ConcreteModel()
m.n = Set(initialize=range(scale))
m.x = Var(m.n, bounds=(-1.0, 1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q = np.ones([scale, scale])
Q_values = dict(
zip(list(itertools.product(range(scale), range(scale))), Q.flatten()))
m.Q = Param(m.n, m.n, initialize=Q_values)
t = time.time()
m.xQx = Constraint(expr=sum(m.x[i]*m.Q[i, j]*m.x[j]
for i in m.n for j in m.n) <= 1.0)
print("Time to make QuadCon = {}".format(time.time() - t))
The time I measured with sum was around 174.4 s. With quicksum I got 163.3 seconds.
Not satisfied with such a modest gain, I tried to re-formulate as a SOCP. If you can factorize Q like so: Q= (F^T F), then you could easily express your constraint as a quadratic cone, as shown below:
import itertools
import time
import pyomo.kernel as pmo
from pyomo.environ import *
import numpy as np
scale = 5000
m = pmo.block()
m.n = np.arange(scale)
m.x = pmo.variable_list()
for j in m.n:
m.x.append(pmo.variable(lb=-1.0, ub=1.0))
# Q = (F^T)F factors (eg.: Cholesky factor)
_F = np.ones([scale, scale])
t = time.time()
F = pmo.parameter_list()
for f in _F:
_row = pmo.parameter_list(pmo.parameter(_e) for _e in f)
F.append(_row)
print("Time taken to make parameter F = {}".format(time.time() - t))
t1 = time.time()
x_expr = pmo.expression_tuple(pmo.expression(
expr=sum_product(f, m.x, index=m.n)) for f in F)
print("Time for evaluating Fx = {}".format(time.time() - t1))
t2 = time.time()
m.xQx = pmo.conic.quadratic.as_domain(1, x_expr)
print("Time for quad constr = {}".format(time.time() - t2))
Running on the same machine, I observed a time of around 112 seconds in the preparation of the expression that gets passed to the cone. Actually preparing the cone takes very little time (0.031 s).
Naturally, the only solver that can handle Conic constraints in pyomo is MOSEK. A recent update to the Pyomo-MOSEK interface has also shown promising speed-ups.
You can try MOSEK for free by getting yourself a MOSEK trial license. If you want to read more about Conic reformulations, a quick and thorough guide can be found in the MOSEK modelling cookbook. Lastly, if you are affiliated with an academic institution, then we can offer you a personal/institutional academic license. Hope you find this useful.

python: get colors from ScalarMappable for entire numpy array

I have a large array of values packed in a 4D numpy array (thousands of values in x,y,z for thousands of times). For each of these values I need 'color vector' (RGBA) from a matplotlib.cm.ScalarMappable object.
I've discovered that looping through such an array becomes rather slow, and I'm wondering if there's a way of significantly speeding it up by taking a different approach. For example, can an entire numpy array (greater than 2D) be passed to a ScalarMappable in order for this operation to take place in a more numpythonic or vectorized way?
My example code for a 3D case:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import timeit
def get_colors_1(data,x,y,z):
colors = np.zeros( (x,y,z,4), dtype=np.float16)
for i in range(x):
for j in range(y):
for k in range(z):
colors[i,j,k,:] = m.to_rgba(data[i,j,k])
return colors
def get_colors_2(data,x,y,z):
colors = np.array([[[m.to_rgba(data[i,j,k]) for k in range(z)] for j in range(y)] for i in range(x)], dtype=np.float16)
return colors
def get_colors_3(data,x,y,z):
colors = np.zeros((x,y,z,4), dtype=np.float16)
for i in range(x):
colors[i,:,:,:] = m.to_rgba(data[i,:,:])
return colors
x, y, z = 30, 20, 10
data = np.random.rand(x,y,z)
cmap = matplotlib.cm.get_cmap('jet')
norm = matplotlib.colors.PowerNorm(vmin=0.0, vmax=1.0, gamma=2.5)
m = matplotlib.cm.ScalarMappable(norm=norm, cmap=cmap)
start_time = timeit.default_timer()
colors = get_colors_1(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
start_time = timeit.default_timer()
colors = get_colors_2(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
start_time = timeit.default_timer()
colors = get_colors_3(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
The third method (passing 2D arrays at time) shows a big performance bump, but I'm wondering if this can be pushed significantly further.
time elapsed: 0.5877857000014046
time elapsed: 0.5911024999986694
time elapsed: 0.004590500000631437
Looking at help ScalarMappable.to_rgba,
def to_rgba(self, x, alpha=None, bytes=False, norm=True):
In the normal case, x is a 1D or 2D sequence of scalars, and
the corresponding ndarray of rgba values will be returned,
based on the norm and colormap set for this ScalarMappable.
There is one special case, for handling images that are already
rgb or rgba, such as might have been read from an image file.
If x is an ndarray with 3 dimensions ...
Is
sm = ScalarMappable( norm=None, cmap=cmap )
flat = data.reshape( -1 ) # a view
rgba = sm.to_rgba( flat, bytes=True, norm=False ).reshape( data.shape + (4,) )
much faster ?
(Source: ScalarMappable to_rgba .)

Python parallelisation of encapsulated for cycle with numba prange. Why not working

In doing some experiments to parallelise 3 encapsulated for cycle with numba, I realised that a naive approach is actually not improving the performance.
The following code produce the following times (in seconds):
0.154625177383 # no numba
0.420143127441 # numba first time (lazy initialisation)
0.196285963058 # numba second time
0.200047016144 # nubma third time
0.199403047562 # nubma fourth time
Any idea what am I doing wrong?
import numpy as np
from numba import jit, prange
import time
def run_1():
dims = [100,100,100]
a = np.zeros(dims)
for x in range(100):
for y in range(100):
for z in range(100):
a[x,y,z] = 1
return a
#jit
def run_2():
dims = [100,100,100]
a = np.zeros(dims)
for x in prange(100):
for y in prange(100):
for z in prange(100):
a[x,y,z] = 1
return a
if __name__ == '__main__':
t = time.time()
run_1()
elapsed1 = time.time() - t
print elapsed1
t = time.time()
run_2()
elapsed2 = time.time() - t
print elapsed2
t = time.time()
run_2()
elapsed3 = time.time() - t
print elapsed3
t = time.time()
run_2()
elapsed3 = time.time() - t
print elapsed3
t = time.time()
run_2()
elapsed3 = time.time() - t
print elapsed3
I wonder if there's any code to JIT in these loops: there's no non-trivial Python code to compile, only thin wrappers over C code (yes, range is C code). Possibly the JIT only adds overhead trying to profile and generate (unsuccessfully) more efficient code.
If you want speed-up, think about parallelization using scipy or maybe direct access to NumPy arrays from Cython.

Is it possible to compute an inverse of sparse matrix in Python as fast as in Matlab?

It takes 0.02 seconds for Matlab to compute the inverse of a diagonal matrix using the sparse command.
P = diag(1:10000);
P = sparse(P);
tic;
A = inv(P);
toc
However, for the Python code it takes forever - several minutes.
import numpy as np
import time
startTime = time.time()
P = np.diag(range(1,10000))
A = np.linalg.inv(P)
runningTime = (time.time()-startTime)/60
print "The script was running for %f minutes" % runningTime
I tried to use Scipy.sparse module but it did not help. The running time dropped, but only to 40 seconds.
import numpy as np
import time
import scipy.sparse as sps
import scipy.sparse.linalg as spsl
startTime = time.time()
P = np.diag(range(1,10000))
P_sps = sps.coo_matrix(P)
A = spsl.inv(P_sps)
runningTime = (time.time()-startTime)/60
print "The script was running for %f minutes" % runningTime
Is it possible to run the code as fast as it runs in Matlab?
Here is the answer. When you run inv in matlab for a sparse matrix, matlab check different properties of the matrix to optimize the calculation. For a sparse diagonal matrix, you can run the followin code to see what is matlab doing
n = 10000;
a = diag(1:n);
a = sparse(a);
I = speye(n,n);
spparms('spumoni',1);
ainv = inv(a);
spparms('spumoni',0);
Matlab will print the following:
sp\: bandwidth = 0+1+0.
sp\: is A diagonal? yes.
sp\: do a diagonal solve.
So matlab is inverting only the diagonal.
How does Scipy invert the matrix??
Here we have the code:
...
from scipy.sparse.linalg import spsolve
...
def inv(A):
"""
Some comments...
"""
I = speye(A.shape[0], A.shape[1], dtype=A.dtype, format=A.format)
Ainv = spsolve(A, I)
return Ainv
and spsolve
# Cover the case where b is also a matrix
Afactsolve = factorized(A)
tempj = empty(M, dtype=int)
x = A.__class__(b.shape)
for j in range(b.shape[1]):
xj = Afactsolve(squeeze(b[:, j].toarray()))
w = where(xj != 0.0)[0]
tempj.fill(j)
x = x + A.__class__((xj[w], (w, tempj[:len(w)])),
shape=b.shape, dtype=A.dtype)
i.e., scipy factorize A and then solve a set of linear systems where the right hand sides are the coordinate vectors (forming the identity matrix). Ordering all the solutions in a matrix we obtain the inverse of the initial matrix.
If matlab is taken advantage of the diagonal structure of the matrix, but scipy is not (of course scipy is also using the structure of the matrix, but in a less efficient way, at least for the example), matlab should be much faster.
EDIT
To be sure, as #P.Escondido propossed, we will try a minor modification in matrix A, to trace the matlab procedure when the matrix is not diagonal:
n = 10000; a = diag(1:n); a = sparse(a); ainv = sparse(n,n);
spparms('spumoni',1);
a(100,10) = 500; a(10,1000) = 200;
ainv = inv(a);
spparms('spumoni',0);
It prints out the following:
sp\: bandwidth = 90+1+990.
sp\: is A diagonal? no.
sp\: is band density (0.00) > bandden (0.50) to try banded solver? no.
sp\: is A triangular? no.
sp\: is A morally triangular? yes.
sp\: permute and solve.
sp\: sprealloc in sptsolve: 10000 10000 10000 15001
How about splu(), it's faster but need a dense array and return dense array:
Create a random matrix:
import numpy as np
import time
import scipy.sparse as sps
import scipy.sparse.linalg as spsl
from numpy.random import randint
N = 1000
i = np.arange(N)
j = np.arange(N)
v = np.ones(N)
i2 = randint(0, N, N)
j2 = randint(0, N, N)
v2 = np.random.rand(N)
i = np.concatenate((i, i2))
j = np.concatenate((j, j2))
v = np.concatenate((v, v2))
A = sps.coo_matrix((v, (i, j)))
A = A.tocsc()
%time B = spsl.inv(A)
calculate inverse matrix by splu():
%%time
lu = spsl.splu(A)
eye = np.eye(N)
B2 = lu.solve(eye)
check the result:
np.allclose(B.todense(), B2.T)
Here is the %time output:
inv: 2.39 s
splv: 193 ms
You are witholding crucial information from your software: the fact that the matrix is diagonal makes it super easy to invert: you simply invert each element of its diagonal:
P = np.diag(range(1,10000))
A = np.diag(1.0/np.arange(1,10000))
Of course, this is only valid for diagonal matrices...
If you try with that the result will be better:
import numpy as np
import time
import scipy.sparse as sps
import scipy.sparse.linalg as spsl
P = np.diag(range(1,10000))
P_sps = sps.coo_matrix(P)
startTime = time.time()
A = spsl.inv(P_sps)
runningTime = (time.time()-startTime)/60
print "The script was running for %f minutes" % runningTime
Now you can compare with your matlab script.

Categories

Resources