pathos: parallel processing options - Could someone explain the differences?

pathos: parallel processing options - Could someone explain the differences? - python

I am trying to run parallel processes under python (on ubuntu).
I started using multiprocessing and it worked fine for simple examples.
Then came the pickle error, and so I switched to pathos. I got a little confused with the different options and so wrote a very simple benchmarking code.
import multiprocessing as mp
from pathos.multiprocessing import Pool as Pool1
from pathos.pools import ParallelPool as Pool2
from pathos.parallel import ParallelPool as Pool3
import time
def square(x):
# calculate the square of the value of x
return x*x
if __name__ == '__main__':
dataset = range(0,10000)
start_time = time.time()
for d in dataset:
square(d)
print('test with no cores: %s seconds' %(time.time() - start_time))
nCores = 3
print('number of cores used: %s' %(nCores))
start_time = time.time()
p = mp.Pool(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with multiprocessing: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool1(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos multiprocessing: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool2(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos pools: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool3()
p.ncpus = nCores
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos parallel: %s seconds' %(time.time() - start_time))
I get about
- 0.001s with plain serial code, without parallel,
- 0.100s with multiprocessing option,
- 0.100s with pathos.multiprocessing,
- 4.470s with pathos.pools,
- an AssertionError error with pathos.parallel
I copied how to use these various options from http://trac.mystic.cacr.caltech.edu/project/pathos/browser/pathos/examples.html
I understand that parallel processing is longer than a plain serial code for such a simple example. What I do not understand is the relative performance of pathos.
I checked discussions, but could not understand why pathos.pools is so much longer, and why I get an error (not sure then what the performance of that last option would be).
I also tried with a simple square function, and for that even pathos.multiprocessing is much longer than multiprocessing
Could someone explain the differences between these various options?
Additionally, I ran the pathos.multiprocessing option on a remote computer, running centOS, and performance is about 10 times worse than multiprocessing.
According to company renting the computer, it should work just like a home computer. I understand that it will, maybe, be difficult to provide info without more details on the machine, but if you have any ideas as to where it could come from, that would help.

I'm the pathos author. Sorry for the confusion. You are dealing with a mix of the old and new programming interface.
The "new" (suggested) interface is to use pathos.pools. The old interface links to the same objects, so it's really two ways to get to the same thing.
multiprocess.Pool is a fork of multiprocessing.Pool, with the only difference being that multiprocessing uses pickle and multiprocess uses dill. So, I'd expect the speed to be the same in most simple cases.
The above pool can also be found at pathos.pools._ProcessPool. pathos provides a small wrapper around several types of pools, with different backends, giving an extended functionality. The pathos-wrapped pool is pathos.pools.ProcessPool (and the old interface provides it at pathos.multiprocessing.Pool).
The preferred interface is pathos.pools.ProcessPool.
There's also the ParallelPool, which uses a different backend -- it uses ppft instead of multiprocess. ppft is "parallel python" which spawns python processes through subprocess and passes source code (with dill.source instead of serialized objects) -- it's intended for distributed computing, or when passing by source code is a better option.
So, pathos.pools.ParallelPool is the preferred interface, and pathos.parallel.ParallelPool (and a few other similar references in pathos) are hanging around for legacy reasons -- but they are the same object underneath.
In summary:
>>> import multiprocessing as mp
>>> mp.Pool()
<multiprocessing.pool.Pool object at 0x10fa6b6d0>
>>> import multiprocess as mp
>>> mp.Pool()
<multiprocess.pool.Pool object at 0x11000c910>
>>> import pathos as pa
>>> pa.pools._ProcessPool()
<multiprocess.pool.Pool object at 0x11008b0d0>
>>> pa.multiprocessing.Pool()
<multiprocess.pool.Pool object at 0x11008bb10>
>>> pa.pools.ProcessPool()
<pool ProcessPool(ncpus=4)>
>>> pa.pools.ParallelPool()
<pool ParallelPool(ncpus=*, servers=None)>
You can see the ParallelPool has servers... thus is intended for distributed computing.
The only remaining question is why the AssertionError? Well that is because the wrapper that pathos adds keeps a pool object available for reuse. Hence, when you call the ParallelPool a second time, you are calling a closed pool. You'd need to restart the pool to enable it to be used again.
>>> f = lambda x:x
>>> p = pa.pools.ParallelPool()
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.restart() # throws AssertionError w/o this
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.clear() # destroy the saved pool
The ProcessPool has the same interface as ParallelPool, with respect to restarting and clearing saved instances.

Could someone explain the differences?
Let's start from some common ground.
Python interpreter uses, as standard, a GIL-stepped code-execution. This means, that all thread-based pools still do wait for a GIL-stepped ordering of all code-execution paths, so any such constructed attempt will not enjoy benefits " theoretically expected ".
Python interpreter may use other, process-based instances for loading multiple process, each having its own GIL-lock, forming a pool of multiple, concurrent code-execution paths.
Having managed this principal dis-ambiguation, the performance-related questions start to appear next. The most responsible approach is to benchmark, benchmark, benchmark. No exception here.
What does it take so much time to spend here ( where )?
Major ( constant ) part is a primarily [TIME]-domain cost of a process-instantiation. Here, the complete replica of python interpreter, including all variables, all memory-maps, indeed a complete state-full-copy of the calling python interpreter has to be first created and placed onto the operating system process-scheduler table, before any further ( useful part of the job ) computing "inside" such successfully instantiated sub-process can take place. If your payload function just immediately returns from there, having created an x*x, your code seems to have burnt all that fuel for a few CPU-instructions and you have spent way more than received in return. Economy of costs goes against you, as all the process-instantiation plus process-termination costs are way higher than a few CPU-CLOCK ticks.
How long does this actually take?
You can benchmark this ( as proposed here, in a proposed Test-Case-A. If Stopwatch()-ed [us] decide, you start to rely on facts more than on any sorts of wannabe-guru or marketing type of advice. That's fair, isn't it? ).
Test-Case-A benchmarks process-instantiation costs [MEASURED].What next?
The next most dangerous ( variable in size ) part is a primarily [SPACE]-domain costs, yet having also the [TIME]-domain impact, if [SPACE]-allocation costs start to grow beyond small footprint scales.
This sort of add-on overhead costs is related to any need to pass "large"-sized parameters, from the "main"-python interpreter to each and every of the ( distributed ) sub-process instances.
How long does this take?
Again, benchmark, benchmark, benchmark. Shall benchmark this ( as proposed here, if extending a there proposed Test-Case-C with a replacement of aNeverConsumedPAR parameter with some indeed "fat"-chunk of data, be it a numpy.ndarray() or other type, bearing some huge memory-footprint. )
This way, the real hardware-related + O/S-related + python-related data-flow costs start to become visible and measured in such a benchmark as additional overhead costs in **[us]**. This is nothing new to ol' hackers, yet, people who never met that an HDD-disk-write times could grow into and block other processing for many seconds or minutes would hardly believe, if not touching by one's own benchmarking the real costs of data-flow. So, do not hesitate to extend the benchmark Test-Case-C to indeed large memory-footprints to smell the smoke ...
Last, but not least, the re-formulated Amdahl's Law will tell ...
given an attempt to parallelise some computation is well understood both as per the computing part and also as per all the overhead-part(s), the picture starts to get complete:
The overhead-strict and resources-aware Amdahl's Law re-formulation shows:
1
S = ______________________________________________ ; where s,
/ \ ( 1 - s ),
| ( 1 - s ) | pSO,
s + pSO + max| _________ , atomicP | + pTO pTO,
| N | N
\ / have been defined in
just an Overhead-strict Law
and
atomicP := is a further indivisible duration of an atomic-process-block
That the resulting speedup S will always suffer from high overhead costs pSO + pTO the same as when whatever high N will not be allowed to further help, because of a high enough value of atomicP.
In all these cases the final speedup S may easily fall under << 1.0, yes, well under a pure-[SERIAL] code-execution path schedule ( again, having benchmarked the real costs of pSO and pTO ( for which the Test-Case-A + Test-Case-C ( extended ) was schematically proposed ) there comes a chance to derive the minimum reasonable computing-payload needed so as to remain above the mystic level of a Speedup >= 1.0

Related

Python Pool Multiprocessing Poor CPU Usage

I have a bunch of independent N body sims I want to run in parallel in python. The walltime for individual sims is going to vary dramatically depending on the parameters of the bodies in the sims. It seemed like the best way to do this would be to build pool of processes with the multiprocessing module, give them the sim jobs with the starmap() function, and have them save the results to separate files based on the process ID. However, I've getting awful parallel performance. There is no speedup between 2 and 4 processes (I have 4 CPU on my laptop) and the unix time utility seems to think that the CPU usage percentage is ~150% which is terrible. Below is my code:
import rebound
import numpy as np
import multiprocessing as mp
def two_orbits_one_pool(orbit1, orbit2):
#######################################
print('process number', mp.current_process().name)
#######################################
# build simulation
sim = rebound.Simulation()
# add sun
sim.add(m=1.)
# add two overlapping orbits
sim.add(primary=sim.particles[0], m=orbit1['m'], a=orbit1['a'], e=orbit1['e'], inc=orbit1['i'], \
pomega=orbit1['lop'], Omega=orbit1['lan'], M=orbit1['M'])
sim.add(primary=sim.particles[0], m=orbit2['m'], a=orbit2['a'], e=orbit2['e'], inc=orbit2['i'], \
pomega=orbit2['lop'], Omega=orbit2['lan'], M=orbit2['M'])
sim.move_to_com()
# integrate for 10 orbits of orbit1
P = 2.*np.pi * np.sqrt(orbit1['a']**3)
sim.automateSimulationArchive("archive-{}.bin".format(mp.current_process().name), interval=P)
sim.integrate(10.*P)
if __name__ == "__main__":
# orbit definitions
N_M = 10
N_lop = 10
m = 1e-6
a, e = 1., 0.3
inc, lop, lan = 0., 0., 0.
M = np.linspace(0., 2*np.pi, endpoint=False, num=N_M)
dlop = np.linspace(0., 0.05, num=N_lop)
# orbit dictionaries
args = []
for i in range(dlop.shape[0]):
for j in range(M.shape[0]):
for k in range(M.shape[0]):
args.append( ( {'m':m, 'a':a, 'e':e, 'i':inc, \
'lop':lop, 'lan':lan, 'M':M[j]},
{'m':m, 'a':a, 'e':e, 'i':inc, \
'lop':lop+dlop[i], 'lan':lan, 'M':M[k]} ) )
# fill the pool with orbit jobs
with mp.Pool() as pool:
pool.starmap(two_orbits_one_pool, args)
Could someone explain why this is performing so poorly? I'm much more used to OpenMP and MPI; I'm not that familiar with parallel programming in Python. Overall, I've been quite disappointed in the multiprocessing module. I think maybe I should try using the numba module instead?
EDIT:
In response to Roland Smith's response, I profiled the integration and save time for my code. Here is a stripplot showing the results. As you can see, both Roland Smith's and J_H's suggestions were true. There is a subset of initial conditions that result in extremely long integration times due to close encounters between the bodes. However, in general, the save time was about 5 times longer than the integration time. The job suffers from stragglers and is disk i/o bound.

If there is no discernable speedup, then probably your code is not CPU-bound.
In general, writing to a disk (even an SSD) is much slower than running code on the CPU.
If several worker processes are writing significant amounts of data to disk, that might be the bottleneck.
To diagnose the problem, you have to measure.
You should separate the calculations from the saving of the data; e.g. run sim.integrate() followed by sim.simulationarchive_snapshot() 10 times, and sandwich each of those calls between time.monotonic() calls. Then return the average time of the integration step and the snapshot steps as shown below.
import time
def two_orbits_one_pool(orbit1, orbit2):
#######################################
print('process number', mp.current_process().name)
#######################################
# build simulation
sim = rebound.Simulation()
# add sun
sim.add(m=1.)
# add two overlapping orbits
sim.add(primary=sim.particles[0], m=orbit1['m'], a=orbit1['a'], e=orbit1['e'], inc=orbit1['i'], \
pomega=orbit1['lop'], Omega=orbit1['lan'], M=orbit1['M'])
sim.add(primary=sim.particles[0], m=orbit2['m'], a=orbit2['a'], e=orbit2['e'], inc=orbit2['i'], \
pomega=orbit2['lop'], Omega=orbit2['lan'], M=orbit2['M'])
sim.move_to_com()
# integrate for 10 orbits of orbit1
P = 2.*np.pi * np.sqrt(orbit1['a']**3)
arname = "archive-{}.bin".format(mp.current_process().name)
itime, stime = 0.0, 0.0
for k in range(10):
start = time.monotonic()
sim.integrate(P)
itime += time.monotonic() - start
start = time.monotonic()
sim.simulationarchive_snapshot(arname)
stime += time.monotonic() - start
return (mp.current_process().name, itime/10, stime/10)
# Run the calculations
with mp.Pool() as pool:
data = pool.starmap(two_orbits_one_pool, args)
# Print the times that it took.
for name, itime, stime in data:
print(f"worker {name}: itime {itime} s, stime {stime} s")
That should tell you what the bottleneck is.
Possible solutions if writing to disk is the bottleneck;
Use an SSD to store the simulation results.
Use a RAM-disk to store the simulation results. (Although compared to an SSD not a huge performance boost.)
Check if you can tune your OS for maximum write performance.
Edit1: Given your measurement result, the obvious performance improvement is to save less often.
Another option that might be worth looking at is staggering the writes. That only makes sense if there is significant overlap between the writes from different processes, and if those concurrent writes can saturate the disk I/O subsystem. So you'd have to measure that first.
If there is overlap, create a Lock object in the parent process. Then acquire the lock before (explicitly) saving, and release it after. This won't work with automateSimulationArchive.
A last option is to write your own save function using mmap. Using mmap is somewhat clunky compared to normal file handling in Python. But it can significantly improve performance. However I am unsure that the gains justify the effort in this case.

The straggler effect can have a big impact on such jobs.
straggler effect
Suppose you have N tasks for N cores,
and each task has a different duration.
Order by duration to find min_time and max_time.
All N cores will be busy up through min_time,
but then they go idle, one by one.
Just before max_time, only a single "straggler" core is being used.
predictions
If you can make a decent guess about task duration beforehand,
use that to sort them in descending order.
For T tasks > N cores, schedule the long tasks first.
Then N tasks run for a while, the shortest of those completes,
and the now-idle core picks up a task of "medium" duration.
By the time we get to the T-th task, each core has a random
amount of work still to do, and we're scheduling a "short" task.
So cores are mostly busy doing useful work, right up till near the end.
logging
If you cannot make a useful duration estimate a priori,
at least record the start times and durations.
Use that to figure out whether stragglers are causing you grief,
or if it's something else like L3 cache thrashing.

Dask delayed performance issues

I'm relatively new to Dask. I'm trying to parallelize a "custom" function that doesn't use Dask containers. I would just like to speed up the computation. But my results are that when I try parallelizing with dask.delayed, it has significantly worse performance than running the serial version. Here is a minimal implementation demonstrating the issue (the code I actually want to do this with is significantly more involved :) )
import dask,time
def mysum(rng):
# CPU intensive
z = 0
for i in rng:
z += i
return z
# serial
b = time.time(); zz = mysum(range(1, 1_000_000_000)); t = time.time() - b
print(f'time to run in serial {t}')
# parallel
ms_parallel = dask.delayed(mysum)
ss = []
ncores = 10
m = 100_000_000
for i in range(ncores):
lower = m*i
upper = (i+1) * m
r = range(lower, upper)
s = ms_parallel(r)
ss.append(s)
j = dask.delayed(ss)
b = time.time(); yy = j.compute(); t = time.time() - b
print(f'time to run in parallel {t}')
Typical results are:
time to run in serial 55.682398080825806
time to run in parallel 135.2043571472168
It seems I'm missing something basic here.

You are running a pure CPU-bound computation in threads by default. Because of python's Global Interpreter Lock (GIL), only one thread is actually running at a time. In short, you are only adding overhead to your original compute, due to thread switching and task executing.
To actually get faster for this workload, you should use dask-distributed. Just adding
import dask.distributed
client = dask.distributed.Client(threads_per_worker=1)
at the start of your script may well give you a decent speed up, since this invokes a certain number of processes, each with their own GIL. This scheduler becomes the default one just by creating it.
EDIT: ignore the following, I see you are already doing it :). Leaving here for others, unless people want it gone ...The second problem, for dask, is the sheer number of tasks. For any task execution system, there is an overhead associated with each task (actually, this is higher for distributed than the default threads scheduler). You could get around it by computing batches of function calls per task. This is, in practice, what dask.array and dask.dataframe do: they operate on largeish pieces of the overall problem, such that the overhead becomes small compared to the useful CPU execution time.

Why is multiprocessing slower here?

I am trying to speed up some code with multiprocessing in Python, but I cannot understand one point. Assume I have the following dumb function:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
When I run this code without using multiprocessing (see the code below) on my laptop (Intel - 8 cores cpu) time taken is ~2.31 seconds.
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
Instead, when I run this code by using Python multiprocessing library (see the code below) time taken is ~6.0 seconds.
pool = Pool(8)
t1 = time.time()
pool.map(foo, range(8))
print(f"Sample multiprocessing {time.time() - t1}")
From the best of my knowledge, I understand that when using multiprocessing there is some time overhead mainly caused by the need to spawn the new processes and to copy the memory state. However, this operation should be performed just once when the processed are initially spawned at the very beginning and should not be that huge.
So what I am missing here? Is there something wrong in my reasoning?
Edit: I think it is better to be more explicit on my question. What I expected here was the multiprocessed code to be slightly slower than the sequential one. It is true that I don't split the whole work across the 8 cores, but I am using 8 cores in parallel to do the same job (hence in an ideal world the processing time should more or less stay the same). Considering the overhead of spawning new processes, I expected a total increase in time of some (not too big) percentage, but not of a ~2.60x increase as I got here.

Well, multiprocessing can't possibly make this faster: you're not dividing the work across 8 processes, you're asking each of 8 processes to do the entire thing. Each process will take at least as long as your code doing it just once without using multiprocessing.
So if multiprocessing weren't helping at all, you'd expect it to take about 8 times as long (it's doing 8x the work!) as your single-processor run. But you said it's not taking 2.31 * 8 ~= 18.5 seconds, but "only" about 6. So you're getting better than a factor of 3 speedup.
Why not more than that? Can't guess from here. That will depend on how many physical cores your machine has, and how much other stuff you're running at the same time. Each process will be 100% CPU-bound for this specific function, so the number of "logical" cores is pretty much irrelevant - there's scant opportunity for processor hyper-threading to help. So I'm guessing you have 4 physical cores.
On my box
Sample timing on my box, which has 8 logical cores but only 4 physical cores, and otherwise left the box pretty quiet:
Without multiprocessing 2.468580484390259
Sample multiprocessing 4.78624415397644
As above, none of that surprises me. In fact, I was a little surprised (but pleasantly) at how effectively the program used up the machine's true capacity.

#TimPeters already answered that you are actually just running the job 8 times across the 8 Pool subprocesses, so it is slower not faster.
That answers the issue but does not really answer what your real underlying question was. It is clear from your surprise at this result, that you were expecting that the single job to somehow be automatically split up and run in parts across the 8 Pool processes. That is not the way that it works. You have to build in/tell it how to split up the work.
Different kinds of jobs needs need to be subdivided in different ways, but to continue with your example you might do something like this:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
def foo2(job_desc):
start, stop = job_desc
print(f"{start}, {stop}")
for _ in range(start, stop):
a = 3
def main():
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
pool_size = 8
pool = Pool(pool_size)
t1 = time.time()
top_num = 100000000
size = top_num // pool_size
job_desc_list = [[size * j, size * (j+1)] for j in range(pool_size)]
# this is in case the the upper bound is not a multiple of pool_size
job_desc_list[-1][-1] = top_num
pool.map(foo2, job_desc_list)
print(f"Sample multiprocessing {time.time() - t1}")
if __name__ == "__main__":
main()
Which results in:
Without multiprocessing 3.080709171295166
0, 12500000
12500000, 25000000
25000000, 37500000
37500000, 50000000
50000000, 62500000
62500000, 75000000
75000000, 87500000
87500000, 100000000
Sample multiprocessing 1.5312283039093018
As this shows, splitting the job up does allow it to take less time. The speedup will depend on the number of CPUs. In a CPU bound job you should try to limit it the pool size to the number of CPUs. My laptop has plenty more CPU's but some of the benefit is lost to the overhead. If the jobs were longer this should look more useful.

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5

Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Multiprocessing in Python. Why is there no speed-up?

I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".

Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.

First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.

If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.