I'm trying to compare Pyfftw (in Python 3.6) with matlab r2017a fft.
import time
import numpy
import pyfftw
import multiprocessing
nthread = multiprocessing.cpu_count()
print(nthread)
n=2**20
a = pyfftw.empty_aligned(n, dtype='complex128')
print("fft_object = pyfftw.builders.fft(a)")
fft_object = pyfftw.builders.fft(a) #this instruction spend much time
print("generate numbers")
a[:]= 5*numpy.random.rand(n)
print(a)
print("start fft")
start = time.clock()
y=fft_object()
end4 = time.clock() - start
print(end, time:")
print(end4)
print("result")
print(y)
print(len(y))
while if i use matlab:
x=5*rand(2^20,1);tic;fft(x);toc
this request just the time for computation of fft algorithm, that is the approximatively the same time of the python's call on fft_object().
Thanks in advance for your kind support.
You might take a look at GPU-based codes (if you have the proper hardware):
http://pypi.python.org/pypi/pyfft
http://pypi.python.org/pypi/scikits.cuda
They are based on PyCuda und PyOpenCL. I don't have much experience with these so you have to do a little digging to find what suits you best.
Related
I have a a loop in which I'm calculating several pseudoinverses of rather large, non-sparse matrices (eg. 20000x800).
As my code spends most time on the pinv, I was trying to find a way to speed up the computation. I'm already using multiprocessing (joblib/loky) to run with several processes, but that of course increases also overhead. Using jit did not help much.
Is there a faster way / better implementation to compute pseudoinverse using any function? Precision isn't key.
My current benchmark
import time
import numba
import numpy as np
from numpy.linalg import pinv as np_pinv
from scipy.linalg import pinv as scipy_pinv
from scipy.linalg import pinv2 as scipy_pinv2
#numba.njit
def np_jit_pinv(A):
return np_pinv(A)
matrix = np.random.rand(20000, 800)
for pinv in [np_pinv, scipy_pinv, scipy_pinv2, np_jit_pinv]:
start = time.time()
pinv(matrix)
print(f'{pinv.__module__ +"."+pinv.__name__} took {time.time()-start:.3f}')
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
EDIT:
JAX seems to be 30% faster! impressive! Thanks for letting me know #yuri-brigance . For Windows it works well under WSL.
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
jax._src.numpy.linalg.pinv took 0.995
Try with JAX:
import jax.numpy as jnp
jnp.linalg.pinv(A)
Seems to be slightly faster than regular numpy.linalg.pinv. On my machine your benchmark looks like this:
jax._src.numpy.linalg.pinv took 3.127
numpy.linalg.pinv took 4.284
I wanted to see which is faster:
import numpy as np
np.sqrt(4)
-or-
from numpy import sqrt
sqrt(4)
Here is the code I used to find the average time to run each.
def main():
import gen_funs as gf
from time import perf_counter_ns
t = 0
N = 40
for j in range(N):
tic = perf_counter_ns()
for i in range(100000):
imp2() # I ran the code with this then with imp1()
toc = perf_counter_ns()
t += (toc - tic)
t /= N
time = gf.ns2hms(t) # Converts ns to readable object
print("Ave. time to run: {:d}h {:d}m {:d}s {:d}ms" .
format(time.hours, time.minutes, time.seconds, time.milliseconds))
def imp1():
import numpy as np
np.sqrt(4)
return
def imp2():
from numpy import sqrt
sqrt(4)
return
if __name__ == "__main__":
main()
When I import numpy as np then call np.sqrt(4), I get an average time of about 229ms (time to run the loop 10**4 times).
When I run from numpy import sqrt then call sqrt(4), I get an average time of about 332ms.
Since there is such a difference in time to run, what is the benefit to running from numpy import sqrt? Is there a memory benefit or some other reason why I would do this?
I tried timing with the time bash command. I got 215ms for importing numpy and running sqrt(4) and 193ms for importing sqrt from numpy with the same command. The difference is negligible, honestly.
However, if you don't need a certain aspect of a module, importing it is not encouraged.
In this particular case, since there is no discernable performance benefit and because there are few situations in which you would import just numpy.sqrt (math.sqrt is ~4x faster. The extra features numpy.sqrt offers would only be useable if you have numpy data, which would require you to import the entire module, of course).
There might be a rare scenario in which you don't need all of numpy but still need numpy.sqrt, e.g. using pandas.DataFrame.to_numpy() and manipulating the data in some ways, but honestly I don't feel the 20ms of speed is worth anything in the real world. Especially since you saw worse performance with importing just numpy.sqrt.
I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.
This is one of the standard example code we find every where...
import time
import numpy
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
import pycuda.autoinit
size = 1e7
t0 = time.time()
x = numpy.linspace(1, size, size).astype(numpy.float32)
y = numpy.sin(x)
t1 = time.time()
cpuTime = t1-t0
print(cpuTime)
t0 = time.time()
x_gpu = gpuarray.to_gpu(x)
y_gpu = cumath.sin(x_gpu)
y = y_gpu.get()
t1 = time.time()
gpuTime = t1-t0
print(gpuTime)
the results are: 200 msec for cpu and 2.45 sec for GPU... more then 10X
I'm running on win 10... vs 2015 with PTVS...
Best regards...
Steph
It looks like pycuda introduces some additional overhead the first time you call the cumath.sin() function (~400ms on my system). I suspect this is due to the need to compile CUDA code for the function being called. More importantly, this overhead is independent of the size of the array being passed to the function. Additional calls to cumath.sin() are much faster, with CUDA code already compiled for use. On my system, the gpu code given in the question runs in about 20ms (for repeated runs), compared to roughly 130ms for the numpy code.
I don't profess to know much at all about the inner workings of pycuda, so would be interested to hear other people's opinions on this.
I am running into a bizarre problem that I can't explain. I'm hoping someone out there can help please!
I'm running Python 2.7.3 and Scipy v0.14.0 and am trying to implement some very simple multiprocessor algorithms to speeds up my code using the module multiprocessing. I've managed to make a basic example work:
import multiprocessing
import numpy as np
import time
# import scipy.special
def compute_something(t):
a = 0.
for i in range(100000):
a = np.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
This runs fine, returning
Pool size: 8
Built-in: 1.56904006004
Pool : 0.447728157043
But if I uncomment the line import scipy.special, I get:
Pool size: 8
Built-in: 1.58968091011
Pool : 1.59387993813
and I can see that only one core is doing the work on my system. In fact, importing any module from the scipy package seems to have this effect (I've tried several).
Any ideas? I've never seen a case like this before, where an apparently innocuous import can have such a strange and unexpected effect.
Thanks!
Update (1)
Moving the scipy import line to the function compute_something partially improves the problem:
Pool size: 8
Built-in: 1.66807389259
Pool : 0.596321105957
Update (2)
Thanks to #larsmans for testing on a different system. Problem was not confirmed using Scipy v.0.12.0. Moving this query to the scipy mailing list and will post any answers.
After much digging around and posting an issue on the Scipy GitHub site, I've found a solution.
Before I start, this is documented very well here - I'll just give an overview.
This problem is not related to the version of Scipy, or Numpy that I was using. It originates in the system BLAS libraries that Numpy and Scipy use for various linear algebra routines. You can tell which libraries Numpy is linked to by running
python -c 'import numpy; numpy.show_config()'
If you are using OpenBLAS in Linux, you may find that the CPU affinity is set to 1, meaning that once these algorithms are imported in Python (via Numpy/Scipy), you can access at most one core of the CPU. To test this, within a Python terminal run
import os
os.system('taskset -p %s' %os.getpid())
If the CPU affinity is returned as f, of ff, you can access multiple cores. In my case it would start like that, but upon importing numpy or scipy.any_module, it would switch to 1, hence my problem.
I've found two solutions:
Change CPU affinity
You can manually set the CPU affinity of the master process at the top of the main function so that the code looks like this:
import multiprocessing
import numpy as np
import math
import time
import os
def compute_something(t):
a = 0.
for i in range(10000000):
a = math.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
os.system('taskset -cp 0-%d %s' % (pool_size, os.getpid()))
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
Note that selecting a value higher than the number of cores for taskset doesn't seem to matter - it just uses the maximum possible number.
Switch BLAS libraries
Solution documented at the site linked above. Basically: install libatlas and run update-alternatives to point numpy to ATLAS rather than OpenBLAS.