Fastest way for computing pseudoinverse (pinv) in Python

Fastest way for computing pseudoinverse (pinv) in Python - python

I have a a loop in which I'm calculating several pseudoinverses of rather large, non-sparse matrices (eg. 20000x800).
As my code spends most time on the pinv, I was trying to find a way to speed up the computation. I'm already using multiprocessing (joblib/loky) to run with several processes, but that of course increases also overhead. Using jit did not help much.
Is there a faster way / better implementation to compute pseudoinverse using any function? Precision isn't key.
My current benchmark
import time
import numba
import numpy as np
from numpy.linalg import pinv as np_pinv
from scipy.linalg import pinv as scipy_pinv
from scipy.linalg import pinv2 as scipy_pinv2
#numba.njit
def np_jit_pinv(A):
return np_pinv(A)
matrix = np.random.rand(20000, 800)
for pinv in [np_pinv, scipy_pinv, scipy_pinv2, np_jit_pinv]:
start = time.time()
pinv(matrix)
print(f'{pinv.__module__ +"."+pinv.__name__} took {time.time()-start:.3f}')
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
EDIT:
JAX seems to be 30% faster! impressive! Thanks for letting me know #yuri-brigance . For Windows it works well under WSL.
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
jax._src.numpy.linalg.pinv took 0.995

Try with JAX:
import jax.numpy as jnp
jnp.linalg.pinv(A)
Seems to be slightly faster than regular numpy.linalg.pinv. On my machine your benchmark looks like this:
jax._src.numpy.linalg.pinv took 3.127
numpy.linalg.pinv took 4.284

Related

Why does Numba skew the timings of a JIT-compiled function?

I'm trying to benchmark a Python function that does list operations using Numba against the CPython interpreter. To compare end-to-end time I used the Linux time utility.
time python3.10 list.py
As I understand the first invocation will be expensive due to JIT compilation, but it does not explain why the maximum recorded time is longer than the total time taken to run the entire script.
# list.py
import numpy as np
from time import time, perf_counter
from numba import njit
#njit
def listOperations():
list = []
for i in range(1000):
list.append(i)
list.sort(reverse=True)
list.remove(420)
list.reverse()
if __name__ == "__main__":
repetitions = 1000
timings = np.zeros(repetitions)
for rep in range(repetitions):
start = time() # Similar results with perf_counter too.
listOperations()
timings[rep] = time() - start
# Convert to milliseconds
timings *= 10e3
print("Mean {}ms, Median {}ms, Std. Dev {}ms, Min {}ms, Max {}ms".format(
float('%.4f' % np.mean(timings)),
float('%.4f' % np.median(timings)),
float('%.4f' % np.std(timings)),
float('%.4f' % np.min(timings)),
float('%.4f' % np.max(timings)))
)
For Numba it shows maximum of ~66.3s while the time utility reports ~8s. The complete results are below.
'''
Numba --->
Mean 66.8154ms, Median 0.391ms, Std. Dev 2097.7752ms, Min 0.3219ms, Max 66371.1143ms
real 0m7.982s
user 0m8.248s
sys 0m0.100s
CPython3.10 --->
Mean 1.6395ms, Median 1.6284ms, Std. Dev 0.0708ms, Min 1.5759ms, Max 2.3198ms
real. 0m1.115s
user 0m1.468s
sys 0m0.080s
'''

The main issue is that the compilation time is included in the timings. Indeed, Numba compiles the functions lazily. To prevent this, you must specify the prototype or to execute the first function call outside (which is generally a good practice in benchmarks).
You can use #njit('()') instead of #njit. With this fix, the Numba code is about twice faster on my machine.
Note that your function does not return anything nor read anything in parameter so the JIT can optimize the function to a no-op. To avoid biases, you certainly need to add a parameter, to use it and to return the list. This is apparently not the case on my machine but different versions of Numba may do that.
Note also that Numba list are generally not where Numba shine. Lists are generally slow (both with and without Numba). It is better to use array when the size is known.
By the way, list is a built-in function. Overwriting it can cause sneaky bugs in modules using it (frequent) so this is not a good idea. I advise you to use another name.
Furthermore, note that the standard deviation was pretty big in the results, the median time was good and the maximum time was very big indicating that the timings were not stable and that this instability was due to one slow call. Such results generally indicates the benchmark is flawed or the function itself has an unstable behaviour (typically due to a bug or an initialization done once).

Any benefits to importing sub modules directly (seems to be slower)?

I wanted to see which is faster:
import numpy as np
np.sqrt(4)
-or-
from numpy import sqrt
sqrt(4)
Here is the code I used to find the average time to run each.
def main():
import gen_funs as gf
from time import perf_counter_ns
t = 0
N = 40
for j in range(N):
tic = perf_counter_ns()
for i in range(100000):
imp2() # I ran the code with this then with imp1()
toc = perf_counter_ns()
t += (toc - tic)
t /= N
time = gf.ns2hms(t) # Converts ns to readable object
print("Ave. time to run: {:d}h {:d}m {:d}s {:d}ms" .
format(time.hours, time.minutes, time.seconds, time.milliseconds))
def imp1():
import numpy as np
np.sqrt(4)
return
def imp2():
from numpy import sqrt
sqrt(4)
return
if __name__ == "__main__":
main()
When I import numpy as np then call np.sqrt(4), I get an average time of about 229ms (time to run the loop 10**4 times).
When I run from numpy import sqrt then call sqrt(4), I get an average time of about 332ms.
Since there is such a difference in time to run, what is the benefit to running from numpy import sqrt? Is there a memory benefit or some other reason why I would do this?

I tried timing with the time bash command. I got 215ms for importing numpy and running sqrt(4) and 193ms for importing sqrt from numpy with the same command. The difference is negligible, honestly.
However, if you don't need a certain aspect of a module, importing it is not encouraged.
In this particular case, since there is no discernable performance benefit and because there are few situations in which you would import just numpy.sqrt (math.sqrt is ~4x faster. The extra features numpy.sqrt offers would only be useable if you have numpy data, which would require you to import the entire module, of course).
There might be a rare scenario in which you don't need all of numpy but still need numpy.sqrt, e.g. using pandas.DataFrame.to_numpy() and manipulating the data in some ways, but honestly I don't feel the 20ms of speed is worth anything in the real world. Especially since you saw worse performance with importing just numpy.sqrt.

Weird behavior of Intel's MKL and Python (Ubuntu 16.04, Conda 4.3.30)

I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.

Pyfftw slower than matlab fft

I'm trying to compare Pyfftw (in Python 3.6) with matlab r2017a fft.
import time
import numpy
import pyfftw
import multiprocessing
nthread = multiprocessing.cpu_count()
print(nthread)
n=2**20
a = pyfftw.empty_aligned(n, dtype='complex128')
print("fft_object = pyfftw.builders.fft(a)")
fft_object = pyfftw.builders.fft(a) #this instruction spend much time
print("generate numbers")
a[:]= 5*numpy.random.rand(n)
print(a)
print("start fft")
start = time.clock()
y=fft_object()
end4 = time.clock() - start
print(end, time:")
print(end4)
print("result")
print(y)
print(len(y))
while if i use matlab:
x=5*rand(2^20,1);tic;fft(x);toc
this request just the time for computation of fft algorithm, that is the approximatively the same time of the python's call on fft_object().
Thanks in advance for your kind support.

You might take a look at GPU-based codes (if you have the proper hardware):
http://pypi.python.org/pypi/pyfft
http://pypi.python.org/pypi/scikits.cuda
They are based on PyCuda und PyOpenCL. I don't have much experience with these so you have to do a little digging to find what suits you best.

Why my GPU piece of code run much slower then cpu

This is one of the standard example code we find every where...
import time
import numpy
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
import pycuda.autoinit
size = 1e7
t0 = time.time()
x = numpy.linspace(1, size, size).astype(numpy.float32)
y = numpy.sin(x)
t1 = time.time()
cpuTime = t1-t0
print(cpuTime)
t0 = time.time()
x_gpu = gpuarray.to_gpu(x)
y_gpu = cumath.sin(x_gpu)
y = y_gpu.get()
t1 = time.time()
gpuTime = t1-t0
print(gpuTime)
the results are: 200 msec for cpu and 2.45 sec for GPU... more then 10X
I'm running on win 10... vs 2015 with PTVS...
Best regards...
Steph

It looks like pycuda introduces some additional overhead the first time you call the cumath.sin() function (~400ms on my system). I suspect this is due to the need to compile CUDA code for the function being called. More importantly, this overhead is independent of the size of the array being passed to the function. Additional calls to cumath.sin() are much faster, with CUDA code already compiled for use. On my system, the gpu code given in the question runs in about 20ms (for repeated runs), compared to roughly 130ms for the numpy code.
I don't profess to know much at all about the inner workings of pycuda, so would be interested to hear other people's opinions on this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way for computing pseudoinverse (pinv) in Python - python

Try with JAX: import jax.numpy as jnp jnp.linalg.pinv(A) Seems to be slightly faster than regular numpy.linalg.pinv. On my machine your benchmark looks like this: jax._src.numpy.linalg.pinv took 3.127 numpy.linalg.pinv took 4.284

Related

Why does Numba skew the timings of a JIT-compiled function?

Any benefits to importing sub modules directly (seems to be slower)?

Weird behavior of Intel's MKL and Python (Ubuntu 16.04, Conda 4.3.30)

Pyfftw slower than matlab fft

Why my GPU piece of code run much slower then cpu

Categories

Resources