I run test sqript. It use numpy.fft.fft(), anfft.fft() based on FFTW and pyfftw.interfaces.numpy_fft.fft() based on FFTW.
here is source of my test script:
import numpy as np
import anfft
import pyfftw
import time
a = pyfftw.n_byte_align_empty(128, 16, 'complex128')
a[:] = np.random.randn(128) + 1j*np.random.randn(128)
time0 = time.clock()
res1 = np.fft.fft(a)
time1 = time.clock()
res2 = anfft.fft(a)
time2 = time.clock()
res3 = pyfftw.interfaces.numpy_fft.fft(a,threads=50)
time3 = time.clock()
print 'Time numpy: %s' % (time1 - time0)
print 'Time anfft: %s' % (time2 - time1)
print 'Time pyfftw: %s' % (time3 - time2)
and I get these results:
Time numpy: 0.00154248116307
Time anfft: 0.0139805208195
Time pyfftw: 0.137729374893
anfft library produce more faster fft on huge data, but what about pyfftw? why it is so slowly?
In this case, spawning more threads than you have CPU cores will not give an increase in performance, and will probably make the program slower due to the overhead of switching threads. 50 threads is complete overkill.
Try benchmarking with one thread.
The problem here is the overhead in using the numpy_fft interface. Firstly, you should enable the cache with pyfftw.interfaces.cache.enable(), and then test the result with timeit. Even using the cache there is a fixed overhead of using the interfaces that is not present if you use the raw interface.
On my machine, on a 128-length array, the overhead of the interface still slows it down more than numpy.fft. As the length increases, this overhead becomes less important, so on say a 16000-length array, the numpy_fft interface is faster.
There are tweaks you can invoke to speed things up on the interfaces end, but these are unlikely to make much difference in your case.
The best way to get the fastest possible transform in all situations is to use the FFTW object directly, and the easiest way to do that is with the builders functions. In your case:
t = pyfftw.builders.fft(a)
timeit t()
With that I get pyfftw being about 15 times faster than np.fft with a 128 length array.
It might be that pyFFTW is actually spending most of its time planning the transform. Try including for example planner_effort='FFTW_ESTIMATE' in the pyfftw fft call, and see how that affects the performance.
Related
I'm relatively new to Dask. I'm trying to parallelize a "custom" function that doesn't use Dask containers. I would just like to speed up the computation. But my results are that when I try parallelizing with dask.delayed, it has significantly worse performance than running the serial version. Here is a minimal implementation demonstrating the issue (the code I actually want to do this with is significantly more involved :) )
import dask,time
def mysum(rng):
# CPU intensive
z = 0
for i in rng:
z += i
return z
# serial
b = time.time(); zz = mysum(range(1, 1_000_000_000)); t = time.time() - b
print(f'time to run in serial {t}')
# parallel
ms_parallel = dask.delayed(mysum)
ss = []
ncores = 10
m = 100_000_000
for i in range(ncores):
lower = m*i
upper = (i+1) * m
r = range(lower, upper)
s = ms_parallel(r)
ss.append(s)
j = dask.delayed(ss)
b = time.time(); yy = j.compute(); t = time.time() - b
print(f'time to run in parallel {t}')
Typical results are:
time to run in serial 55.682398080825806
time to run in parallel 135.2043571472168
It seems I'm missing something basic here.
You are running a pure CPU-bound computation in threads by default. Because of python's Global Interpreter Lock (GIL), only one thread is actually running at a time. In short, you are only adding overhead to your original compute, due to thread switching and task executing.
To actually get faster for this workload, you should use dask-distributed. Just adding
import dask.distributed
client = dask.distributed.Client(threads_per_worker=1)
at the start of your script may well give you a decent speed up, since this invokes a certain number of processes, each with their own GIL. This scheduler becomes the default one just by creating it.
EDIT: ignore the following, I see you are already doing it :). Leaving here for others, unless people want it gone ...The second problem, for dask, is the sheer number of tasks. For any task execution system, there is an overhead associated with each task (actually, this is higher for distributed than the default threads scheduler). You could get around it by computing batches of function calls per task. This is, in practice, what dask.array and dask.dataframe do: they operate on largeish pieces of the overall problem, such that the overhead becomes small compared to the useful CPU execution time.
I am trying to run parallel processes under python (on ubuntu).
I started using multiprocessing and it worked fine for simple examples.
Then came the pickle error, and so I switched to pathos. I got a little confused with the different options and so wrote a very simple benchmarking code.
import multiprocessing as mp
from pathos.multiprocessing import Pool as Pool1
from pathos.pools import ParallelPool as Pool2
from pathos.parallel import ParallelPool as Pool3
import time
def square(x):
# calculate the square of the value of x
return x*x
if __name__ == '__main__':
dataset = range(0,10000)
start_time = time.time()
for d in dataset:
square(d)
print('test with no cores: %s seconds' %(time.time() - start_time))
nCores = 3
print('number of cores used: %s' %(nCores))
start_time = time.time()
p = mp.Pool(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with multiprocessing: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool1(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos multiprocessing: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool2(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos pools: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool3()
p.ncpus = nCores
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos parallel: %s seconds' %(time.time() - start_time))
I get about
- 0.001s with plain serial code, without parallel,
- 0.100s with multiprocessing option,
- 0.100s with pathos.multiprocessing,
- 4.470s with pathos.pools,
- an AssertionError error with pathos.parallel
I copied how to use these various options from http://trac.mystic.cacr.caltech.edu/project/pathos/browser/pathos/examples.html
I understand that parallel processing is longer than a plain serial code for such a simple example. What I do not understand is the relative performance of pathos.
I checked discussions, but could not understand why pathos.pools is so much longer, and why I get an error (not sure then what the performance of that last option would be).
I also tried with a simple square function, and for that even pathos.multiprocessing is much longer than multiprocessing
Could someone explain the differences between these various options?
Additionally, I ran the pathos.multiprocessing option on a remote computer, running centOS, and performance is about 10 times worse than multiprocessing.
According to company renting the computer, it should work just like a home computer. I understand that it will, maybe, be difficult to provide info without more details on the machine, but if you have any ideas as to where it could come from, that would help.
I'm the pathos author. Sorry for the confusion. You are dealing with a mix of the old and new programming interface.
The "new" (suggested) interface is to use pathos.pools. The old interface links to the same objects, so it's really two ways to get to the same thing.
multiprocess.Pool is a fork of multiprocessing.Pool, with the only difference being that multiprocessing uses pickle and multiprocess uses dill. So, I'd expect the speed to be the same in most simple cases.
The above pool can also be found at pathos.pools._ProcessPool. pathos provides a small wrapper around several types of pools, with different backends, giving an extended functionality. The pathos-wrapped pool is pathos.pools.ProcessPool (and the old interface provides it at pathos.multiprocessing.Pool).
The preferred interface is pathos.pools.ProcessPool.
There's also the ParallelPool, which uses a different backend -- it uses ppft instead of multiprocess. ppft is "parallel python" which spawns python processes through subprocess and passes source code (with dill.source instead of serialized objects) -- it's intended for distributed computing, or when passing by source code is a better option.
So, pathos.pools.ParallelPool is the preferred interface, and pathos.parallel.ParallelPool (and a few other similar references in pathos) are hanging around for legacy reasons -- but they are the same object underneath.
In summary:
>>> import multiprocessing as mp
>>> mp.Pool()
<multiprocessing.pool.Pool object at 0x10fa6b6d0>
>>> import multiprocess as mp
>>> mp.Pool()
<multiprocess.pool.Pool object at 0x11000c910>
>>> import pathos as pa
>>> pa.pools._ProcessPool()
<multiprocess.pool.Pool object at 0x11008b0d0>
>>> pa.multiprocessing.Pool()
<multiprocess.pool.Pool object at 0x11008bb10>
>>> pa.pools.ProcessPool()
<pool ProcessPool(ncpus=4)>
>>> pa.pools.ParallelPool()
<pool ParallelPool(ncpus=*, servers=None)>
You can see the ParallelPool has servers... thus is intended for distributed computing.
The only remaining question is why the AssertionError? Well that is because the wrapper that pathos adds keeps a pool object available for reuse. Hence, when you call the ParallelPool a second time, you are calling a closed pool. You'd need to restart the pool to enable it to be used again.
>>> f = lambda x:x
>>> p = pa.pools.ParallelPool()
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.restart() # throws AssertionError w/o this
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.clear() # destroy the saved pool
The ProcessPool has the same interface as ParallelPool, with respect to restarting and clearing saved instances.
Could someone explain the differences?
Let's start from some common ground.
Python interpreter uses, as standard, a GIL-stepped code-execution. This means, that all thread-based pools still do wait for a GIL-stepped ordering of all code-execution paths, so any such constructed attempt will not enjoy benefits " theoretically expected ".
Python interpreter may use other, process-based instances for loading multiple process, each having its own GIL-lock, forming a pool of multiple, concurrent code-execution paths.
Having managed this principal dis-ambiguation, the performance-related questions start to appear next. The most responsible approach is to benchmark, benchmark, benchmark. No exception here.
What does it take so much time to spend here ( where )?
Major ( constant ) part is a primarily [TIME]-domain cost of a process-instantiation. Here, the complete replica of python interpreter, including all variables, all memory-maps, indeed a complete state-full-copy of the calling python interpreter has to be first created and placed onto the operating system process-scheduler table, before any further ( useful part of the job ) computing "inside" such successfully instantiated sub-process can take place. If your payload function just immediately returns from there, having created an x*x, your code seems to have burnt all that fuel for a few CPU-instructions and you have spent way more than received in return. Economy of costs goes against you, as all the process-instantiation plus process-termination costs are way higher than a few CPU-CLOCK ticks.
How long does this actually take?
You can benchmark this ( as proposed here, in a proposed Test-Case-A. If Stopwatch()-ed [us] decide, you start to rely on facts more than on any sorts of wannabe-guru or marketing type of advice. That's fair, isn't it? ).
Test-Case-A benchmarks process-instantiation costs [MEASURED].What next?
The next most dangerous ( variable in size ) part is a primarily [SPACE]-domain costs, yet having also the [TIME]-domain impact, if [SPACE]-allocation costs start to grow beyond small footprint scales.
This sort of add-on overhead costs is related to any need to pass "large"-sized parameters, from the "main"-python interpreter to each and every of the ( distributed ) sub-process instances.
How long does this take?
Again, benchmark, benchmark, benchmark. Shall benchmark this ( as proposed here, if extending a there proposed Test-Case-C with a replacement of aNeverConsumedPAR parameter with some indeed "fat"-chunk of data, be it a numpy.ndarray() or other type, bearing some huge memory-footprint. )
This way, the real hardware-related + O/S-related + python-related data-flow costs start to become visible and measured in such a benchmark as additional overhead costs in **[us]**. This is nothing new to ol' hackers, yet, people who never met that an HDD-disk-write times could grow into and block other processing for many seconds or minutes would hardly believe, if not touching by one's own benchmarking the real costs of data-flow. So, do not hesitate to extend the benchmark Test-Case-C to indeed large memory-footprints to smell the smoke ...
Last, but not least, the re-formulated Amdahl's Law will tell ...
given an attempt to parallelise some computation is well understood both as per the computing part and also as per all the overhead-part(s), the picture starts to get complete:
The overhead-strict and resources-aware Amdahl's Law re-formulation shows:
1
S = ______________________________________________ ; where s,
/ \ ( 1 - s ),
| ( 1 - s ) | pSO,
s + pSO + max| _________ , atomicP | + pTO pTO,
| N | N
\ / have been defined in
just an Overhead-strict Law
and
atomicP := is a further indivisible duration of an atomic-process-block
That the resulting speedup S will always suffer from high overhead costs pSO + pTO the same as when whatever high N will not be allowed to further help, because of a high enough value of atomicP.
In all these cases the final speedup S may easily fall under << 1.0, yes, well under a pure-[SERIAL] code-execution path schedule ( again, having benchmarked the real costs of pSO and pTO ( for which the Test-Case-A + Test-Case-C ( extended ) was schematically proposed ) there comes a chance to derive the minimum reasonable computing-payload needed so as to remain above the mystic level of a Speedup >= 1.0
I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.
I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:
import time
import numpy
from multiprocessing import Pool
def test_func(i):
a = numpy.random.normal(size=1000000)
b = numpy.random.normal(size=1000000)
for i in range(2000):
a = a + b
b = a - b
a = a - b
return 1
t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)
n_par = 4
pool = Pool()
t1 = time.time()
results_async = [
pool.apply_async(test_func, [i])
for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1
print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)
When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?
Some additional info which may be useful:
I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).
when I run this I see n_par processes in top working at 100% CPU
if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).
It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.
Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.
One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so add(a,b,a) will not create a new array, while a = a + b will. If your for loop over numpy arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use numpy.ctypeslib to enable shared memory numpy arrays (see: https://stackoverflow.com/a/5550156/2379433).
I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit.
I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active).
Here is what I have found out with this code:
def somethinglong(b):
n=200000
m=5000
shared=np.arange(n)
for i in np.arange(m):
0.01*shared
pool = mp.Pool(2)
jobs = [() for i in range(8)]
for i in range(5):
timei = time.time()
pool.map(somethinglong, jobs , chunksize=1)
#for job in jobs:
#somethinglong(job)
print(time.time()-timei)
Example that doesn't reach the cache memory limit:
n=10000
m=100000
Sequential execution: 15s
2 processor pool no cache memory limit: 8s
It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8.
Memory cache hits 2 pool
Example that reaches the cache memory limit:
n=200000
m=5000
Sequential execution: 14s
2 processor pool cache memory limit: 14s
In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15.
Memory cache misses 2 pool
Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).
versus something like this:
def time_this(func):
#functools.wraps(func)
def what_time_is_it(*args, **kwargs):
start_time = time.clock()
print 'STARTING TIME: %f' % start_time
result = func(*args, **kwargs)
end_time = time.clock()
print 'ENDING TIME: %f' % end_time
print 'TOTAL TIME: %f' % (end_time - start_time)
return result
return what_time_is_it
I am asking because writing a descriptor likes this seems easier and clearer to me. I recognize that profile/cprofile attempts to estimate bytecode compilation time and such(and subtracts those times from the running time), so more specifically.
I am wondering:
a) when does compilation time become significant enough for such
differences to matter?
b) How might I go about writing my own profiler that takes into
account compilation time?
Profile is slower than cProfile, but does support Threads.
cProfile is a lot faster, but AFAIK it won't profile threads (only the main one, the others will be ignored).
Profile and cProfile have nothing to do with estimating compilation time. They estimate run time.
Compilation time isn't a performance issue. Don't want your code to be compiled every time it's run? import it, and it will be saved as a .pyc, and only recompiled if you change it. It simply doesn't matter how long code takes to compile (it's very fast) since this doesn't have to be done every time it's run.
If you want to time compilation, you can use the compiler package.
Basically:
from timeit import timeit
print timeit('compiler.compileFile(' + filename + ')', 'import compiler', number=100)
will print the time it takes to compile filename 100 times.
If inside func, you append to some lists, do some addition, look up some variables in dictionaries, profile will tell you how long each of those things takes.
Your version doesn't tell you any of those things. It's also pretty inaccurate -- the time you get depends on the time it takes to look up the clock attribute of time and then call it.
If what you want is to time a short section of code, use timeit. If you want to profile code, use profile or cProfile. If what you want to know is how long arbitrary code took to run, but not what parts of it where the slowest, then your version is fine, so long as the code doesn't take just a few miliseconds.