This is one of the standard example code we find every where...
import time
import numpy
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
import pycuda.autoinit
size = 1e7
t0 = time.time()
x = numpy.linspace(1, size, size).astype(numpy.float32)
y = numpy.sin(x)
t1 = time.time()
cpuTime = t1-t0
print(cpuTime)
t0 = time.time()
x_gpu = gpuarray.to_gpu(x)
y_gpu = cumath.sin(x_gpu)
y = y_gpu.get()
t1 = time.time()
gpuTime = t1-t0
print(gpuTime)
the results are: 200 msec for cpu and 2.45 sec for GPU... more then 10X
I'm running on win 10... vs 2015 with PTVS...
Best regards...
Steph
It looks like pycuda introduces some additional overhead the first time you call the cumath.sin() function (~400ms on my system). I suspect this is due to the need to compile CUDA code for the function being called. More importantly, this overhead is independent of the size of the array being passed to the function. Additional calls to cumath.sin() are much faster, with CUDA code already compiled for use. On my system, the gpu code given in the question runs in about 20ms (for repeated runs), compared to roughly 130ms for the numpy code.
I don't profess to know much at all about the inner workings of pycuda, so would be interested to hear other people's opinions on this.
Related
I have a a loop in which I'm calculating several pseudoinverses of rather large, non-sparse matrices (eg. 20000x800).
As my code spends most time on the pinv, I was trying to find a way to speed up the computation. I'm already using multiprocessing (joblib/loky) to run with several processes, but that of course increases also overhead. Using jit did not help much.
Is there a faster way / better implementation to compute pseudoinverse using any function? Precision isn't key.
My current benchmark
import time
import numba
import numpy as np
from numpy.linalg import pinv as np_pinv
from scipy.linalg import pinv as scipy_pinv
from scipy.linalg import pinv2 as scipy_pinv2
#numba.njit
def np_jit_pinv(A):
return np_pinv(A)
matrix = np.random.rand(20000, 800)
for pinv in [np_pinv, scipy_pinv, scipy_pinv2, np_jit_pinv]:
start = time.time()
pinv(matrix)
print(f'{pinv.__module__ +"."+pinv.__name__} took {time.time()-start:.3f}')
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
EDIT:
JAX seems to be 30% faster! impressive! Thanks for letting me know #yuri-brigance . For Windows it works well under WSL.
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
jax._src.numpy.linalg.pinv took 0.995
Try with JAX:
import jax.numpy as jnp
jnp.linalg.pinv(A)
Seems to be slightly faster than regular numpy.linalg.pinv. On my machine your benchmark looks like this:
jax._src.numpy.linalg.pinv took 3.127
numpy.linalg.pinv took 4.284
I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?
What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.
NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.
It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.
I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.
I'm trying to compare Pyfftw (in Python 3.6) with matlab r2017a fft.
import time
import numpy
import pyfftw
import multiprocessing
nthread = multiprocessing.cpu_count()
print(nthread)
n=2**20
a = pyfftw.empty_aligned(n, dtype='complex128')
print("fft_object = pyfftw.builders.fft(a)")
fft_object = pyfftw.builders.fft(a) #this instruction spend much time
print("generate numbers")
a[:]= 5*numpy.random.rand(n)
print(a)
print("start fft")
start = time.clock()
y=fft_object()
end4 = time.clock() - start
print(end, time:")
print(end4)
print("result")
print(y)
print(len(y))
while if i use matlab:
x=5*rand(2^20,1);tic;fft(x);toc
this request just the time for computation of fft algorithm, that is the approximatively the same time of the python's call on fft_object().
Thanks in advance for your kind support.
You might take a look at GPU-based codes (if you have the proper hardware):
http://pypi.python.org/pypi/pyfft
http://pypi.python.org/pypi/scikits.cuda
They are based on PyCuda und PyOpenCL. I don't have much experience with these so you have to do a little digging to find what suits you best.
I am running into a bizarre problem that I can't explain. I'm hoping someone out there can help please!
I'm running Python 2.7.3 and Scipy v0.14.0 and am trying to implement some very simple multiprocessor algorithms to speeds up my code using the module multiprocessing. I've managed to make a basic example work:
import multiprocessing
import numpy as np
import time
# import scipy.special
def compute_something(t):
a = 0.
for i in range(100000):
a = np.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
This runs fine, returning
Pool size: 8
Built-in: 1.56904006004
Pool : 0.447728157043
But if I uncomment the line import scipy.special, I get:
Pool size: 8
Built-in: 1.58968091011
Pool : 1.59387993813
and I can see that only one core is doing the work on my system. In fact, importing any module from the scipy package seems to have this effect (I've tried several).
Any ideas? I've never seen a case like this before, where an apparently innocuous import can have such a strange and unexpected effect.
Thanks!
Update (1)
Moving the scipy import line to the function compute_something partially improves the problem:
Pool size: 8
Built-in: 1.66807389259
Pool : 0.596321105957
Update (2)
Thanks to #larsmans for testing on a different system. Problem was not confirmed using Scipy v.0.12.0. Moving this query to the scipy mailing list and will post any answers.
After much digging around and posting an issue on the Scipy GitHub site, I've found a solution.
Before I start, this is documented very well here - I'll just give an overview.
This problem is not related to the version of Scipy, or Numpy that I was using. It originates in the system BLAS libraries that Numpy and Scipy use for various linear algebra routines. You can tell which libraries Numpy is linked to by running
python -c 'import numpy; numpy.show_config()'
If you are using OpenBLAS in Linux, you may find that the CPU affinity is set to 1, meaning that once these algorithms are imported in Python (via Numpy/Scipy), you can access at most one core of the CPU. To test this, within a Python terminal run
import os
os.system('taskset -p %s' %os.getpid())
If the CPU affinity is returned as f, of ff, you can access multiple cores. In my case it would start like that, but upon importing numpy or scipy.any_module, it would switch to 1, hence my problem.
I've found two solutions:
Change CPU affinity
You can manually set the CPU affinity of the master process at the top of the main function so that the code looks like this:
import multiprocessing
import numpy as np
import math
import time
import os
def compute_something(t):
a = 0.
for i in range(10000000):
a = math.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
os.system('taskset -cp 0-%d %s' % (pool_size, os.getpid()))
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
Note that selecting a value higher than the number of cores for taskset doesn't seem to matter - it just uses the maximum possible number.
Switch BLAS libraries
Solution documented at the site linked above. Basically: install libatlas and run update-alternatives to point numpy to ATLAS rather than OpenBLAS.