python - multiprocessing - static tree traversal - performance gain?

python - multiprocessing - static tree traversal - performance gain? - python

I have a node tree where every node has an id (node number), a list over children and a debth indicator. I am then given a list over nodes which i am to find the debth of. To do this i use a recursive function.
This is all fine and dandy but I want to speed the process up. I've been looking into multiprocessing, but every time I try it, the calculation time goes up (the higher process count, the longer runtime) compared to using no other processes at all.
My code looks like junk from trying to understand a lot of different examples, so il post this psuedocode instead.
class Node:
id = int
children = int[]
debth = int
function makeNodeTree() ...
function find(x, node):
for c in node.children:
if c.id == x: return c
else:
if find(x, c) != None: return result
return None
function main():
search = [nodeid, nodeid, nodeid...]
timerstart
for x in search: find(x, rootNode)
timerstop
timerstart
<split list over number of processes>
<do some multiprocess magic>
<get results>
timerstop
compare the two
I've tried all kinds off tree sizes to see if there is any gain at all, but i have yet to find such a case, which leads me thinking I'm doing something wrong. I guess what I'm asking for is an example/way of doing this traversal with a performance gain, using multiprocessing.
I know there are plenty ways to organize nodes to make this task easy, but i want to check the possible(?) performance boost, if it is possible at all.

Multiprocessing has overhead because every time you add a process it takes time to set it up. Also if you are using standard Python threads you are unlikely to get any speedup because all the threads will still run on one processor. So three thoughts (1) are your really so big that you need to speed it up? (2) spawn subprocesses (3) don't use paralellism at each node, just at the top few levels to minimize overhead.

Related

Use multiprocessing for each recursion step

My requirement is to generate a list permissible combinations. Below code is the simplified version which meets my need.
def getChild(tupF):
if len(tupF) <= 60:
for val in range(1,10): #in actual requirement, it is not a fixed range, but some complex processing to determing the list which need to be appended
t=list(tupF) #I am converting a tuple to list and after appending it, back to tuple as if I just handle as list, some how it didn't work
t.append(val)
getChild(tuple(t))
t=[]
else:
print(tupF)
tup = ()
getChild(tup)
But, as the number of levels are high (60) and each of my combination is completely independent of each other, I would like to make this code multiprocess one.
I tried adding
t.append(val)
tmpLst.append(tuple(t))
t=[]
if __name__ == '__main__':
pool = Pool(processes=3)
pool.map(getChild,tmpLst)
But this didn't work as my worker process is trying to sub-divide further. In my case, I don't think the sub-process would explode as once the parent process has called a set of child process, I am OK with terminating the parent process as the all the desired information are in the tuple I am passing to the child process.
Please let me know whether this problem is right candidate for multiprocessing, if yes provide some guidance on how to make it multiprocess so that I can reduce the computational time. I have no prior experience in writing multiprocessing code, so if you can point to a relevant example, that would be great. Thanks.

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5

Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Algorithm Complexity Analysis for Variable Length Queue BFS

I have developed an algorithm that is kind of a variation of a BFS on a tree, but it includes a probabilistic factor. To check whether a node is the one I am looking for, a statistical test is performed (I won't get into too much detail about this). If the test result is positive, the node is added to another queue (called tested). But when a node fails the test, the nodes in the tested need to be tested again, so this queue is appended to the one with the nodes yet to be tested.
In Python, considering that the queue q starts with the root node:
...
tested = []
while q:
curr = q.pop(0)
p = statistical_test(curr)
if p:
tested.append(curr)
else:
q.extend(curr.children())
q.extend(tested)
tested = []
return tested
As the algorithm is probabilistic, more than one node might be in tested after the search, but that is expected. The problem I am facing is trying to estimate this algorithm's complexity because I can't simply use BFS's complexity as q and tested will have a variable length.
I don't need a closed and definitive answer for this. What I need are some insights on how to deal with this situation.

The worst case scenario is the following process:
All elements 1 : n-1 pass the test and are appended to the tested queue.
Element n fails the test, is removed from q, and n-1 elements from tested are pushed back into q.
Go back to step 1 with n = n-1
This is a classic O(n2) process.

How to use multiprocessing in python

New to python and I want to do parallel programming in the following code, and want to use multiprocessing in python to do it. So how to modify the code? I've been searching method by using Pool, but found limited examples that I can follow. Anyone can help me? Thank you.
Note that setinner and setouter are two independent functions and that's where I want to use parallel programming to reduce the running time.
def solve(Q,G,n):
i = 0
tol = 10**-4
while i < 1000:
inneropt,partition,x = setinner(Q,G,n)
outeropt = setouter(Q,G,n)
if (outeropt - inneropt)/(1 + abs(outeropt) + abs(inneropt)) < tol:
break
node1 = partition[0]
node2 = partition[1]
G = updateGraph(G,node1,node2)
if i == 999:
print "Maximum iteration reaches"
print inneropt

It's hard to parallelize code that needs to mutate the same shared data from different tasks. So, I'm going to assume that setinner and setouter are non-mutating functions; if that's not true, things will be more complicated.
The first step is to decide what you want to do in parallel.
One obvious thing is to do the setinner and setouter at the same time. They're completely independent of each other, and always need to both get done. So, that's what I'll do. Instead of doing this:
inneropt,partition,x = setinner(Q,G,n)
outeropt = setouter(Q,G,n)
… we want to submit the two functions as tasks to the pool, then wait for both to be done, then get the results of both.
The concurrent.futures module (which requires a third-party backport in Python 2.x) makes it easier to do things like "wait for both to be done" than the multiprocessing module (which is in the stdlib in 2.6+), but in this case, we don't need anything fancy; if one of them finishes early, we don't have anything to do until the other finishes anyway. So, let's stick with multiprocessing.apply_async:
pool = multiprocessing.Pool(2) # we never have more than 2 tasks to run
while i < 1000:
# parallelly start both tasks
inner_result = pool.apply_async(setinner, (Q, G, n))
outer_result = pool.apply_async(setouter, (Q, G, n))
# sequentially wait for both tasks to finish and get their results
inneropt,partition,x = inner_result.get()
outeropt = outer_result.get()
# the rest of your loop is unchanged
You may want to move the pool outside the function so it lives forever and can be used by other parts of your code. And if not, you almost certainly want to shut the pool down at the end of the function. (Later versions of multiprocessing let you just use the pool in a with statement, but I think that requires Python 3.2+, so you have to do it explicitly.)
What if you want to do more work in parallel? Well, there's nothing else obvious to do here without restructuring the loop. You can't do updateGraph until you get the results back from setinner and setouter, and nothing else is slow here.
But if you could reorganize things so that each loop's setinner were independent of everything that came before (which may or may not be possible with your algorithm—without knowing what you're doing, I can't guess), you could push 2000 tasks onto the queue up front, then loop by just grabbing results as needed. For example:
pool = multiprocessing.Pool() # let it default to the number of cores
inner_results = []
outer_results = []
for _ in range(1000):
inner_results.append(pool.apply_async(setinner, (Q,G,n,i))
outer_results.append(pool.apply_async(setouter, (Q,G,n,i))
while i < 1000:
inneropt,partition,x = inner_results.pop(0).get()
outeropt = outer_results.pop(0).get()
# result of your loop is the same as before
Of course you can make this fancier.
For example, let's say you rarely need more than a couple hundred iterations, so it's wasteful to always compute 1000 of them. You can just push the first N at startup, and push one more every time through the loop (or N more every N times) so you never do more than N wasted iterations—you can't get an ideal tradeoff between perfect parallelism and minimal waste, but you can usually tune it pretty nicely.
Also, if the tasks don't actually take that long, but you have a lot of them, you may want to batch them up. One really easy way to do this is to use one of the map variants instead of apply_async; this can make your fetching code a tiny bit more complicated, but it makes the queuing and batching code completely trivial (e.g., to map each func over a list of 100 parameters with a chunksize of 10 is just two simple lines of code).

Implementing a dynamic multiple timeline queue

Introduction
I would like to implement a dynamic multiple timeline queue. The context here is scheduling in general.
What is a timeline queue?
This is still simple: It is a timeline of tasks, where each event has its start and end time. Tasks are grouped as jobs. This group of tasks need to preserve its order, but can be moved around in time as a whole. For example it could be represented as:
--t1-- ---t2.1-----------t2.2-------
' ' ' ' '
20 30 40 70 120
I would implement this as a heap queue with some additional constraints. The Python sched module has some basic approaches in this direction.
Definition multiple timeline queue
One queue stands for a resource and a resource is needed by a task. Graphical example:
R1 --t1.1----- --t2.2----- -----t1.3--
/ \ /
R2 --t2.1-- ------t1.2-----
Explaining "dynamic"
It becomes interesting when a task can use one of multiple resources. An additional constraint is that consecutive tasks, which can run on the same resource, must use the same resource.
Example: If (from above) task t1.3 can run on R1 or R2, the queue should look like:
R1 --t1.1----- --t2.2-----
/ \
R2 --t2.1-- ------t1.2----------t1.3--
Functionality (in priority order)
FirstFreeSlot(duration, start): Find the first free time slot beginning from start where there is free time for duration (see detailed explanation at the end).
Enqueue a job as earliest as possible on the multiple resources by regarding the constraints (mainly: correct order of tasks, consecutive tasks on same resource) and using FirstFreeSlot.
Put a job at a specific time and move the tail backwards
Delete a job
Recalculate: After delete, test if some tasks can be executed earlier.
Key Question
The point is: How can I represent this information to provide the functionality efficiently? Implementation is up to me ;-)
Update: A further point to consider: The typical interval structures have the focus on "What is at point X?" But in this case the enqueue and therefore the question "Where is the first empty slot for duration D?" is much more important. So a segment/interval tree or something else in this direction is probably not the right choice.
To elaborate the point with the free slots further: Due to the fact that we have multiple resources and the constraint of grouped tasks there can be free time slots on some resources. Simple example: t1.1 run on R1 for 40 and then t1.2 run on R2. So there is an empty interval of [0, 40] on R2 which can be filled by the next job.
Update 2: There is an interesting proposal in another SO question. If someone can port it to my problem and show that it is working for this case (especially elaborated to multiple resources), this would be probably a valid answer.

Let's restrict ourselves to the simplest case first: Find a suitable data structure that allows for a fast implementation of FirstFreeSlot().
The free time slots live in a two-dimensional space: One dimension is the start time s, the other is the length d. FirstFreeSlot(D) effectively answers the following query:
min s: d >= D
If we think of s and d as a cartesian space (d=x, s=y), this means finding the lowest point in a subplane bounded by a vertical line. A quad-tree, perhaps with some auxiliary information in each node (namely, min s over all leafs), will help answering this query efficiently.
For Enqueue() in the face of resource constraints, consider maintaining a separate quad-tree for each resource. The quad tree can also answer queries like
min s: s >= S & d >= D
(required for restricting the start data) in a similar fashion: Now a rectangle (open at the top left) is cut off, and we look for min s in that rectangle.
Put() and Delete() are simple update operations for the quad-tree.
Recalculate() can be implemented by Delete() + Put(). In order to save time for unnecessary operations, define sufficient (or, ideally, sufficient + necessary) conditions for triggering a recalculation. The Observer pattern might help here, but remember putting the tasks for rescheduling into a FIFO queue or a priority queue sorted by start time. (You want to finish rescheduling the current task before taking over to the next.)
On a more general note, I'm sure you are aware that most kind of scheduling problems, especially those with resource constraints, are NP-complete at least. So don't expect an algorithm with a decent runtime in the general case.

class Task:
name=''
duration=0
resources=list()
class Job:
name=''
tasks=list()
class Assignment:
task=None
resource=None
time=None
class MultipleTimeline:
assignments=list()
def enqueue(self,job):
pass
def put(self,job):
pass
def delete(self,job):
pass
def recalculate(self):
pass
Is this a first step in the direction you are looking for, i.e. a data model written out in Python?
Update:
Hereby my more efficient model:
It basicly puts all Tasks in a linked list ordered by endtime.
class Task:
name=''
duration=0 # the amount of work to be done
resources=0 # bitmap that tells what resources this task uses
# the following variables are only used when the task is scheduled
next=None # the next scheduled task by endtime
resource=None # the resource this task is scheduled
gap=None # the amount of time before the next scheduled task starts on this resource
class Job:
id=0
tasks=list() # the Task instances of this job in order
class Resource:
bitflag=0 # a bit flag which operates bitwisely with Task.resources
firsttask=None # the first Task instance that is scheduled on this resource
gap=None # the amount of time before the first Task starts
class MultipleTimeline:
resources=list()
def FirstFreeSlot():
pass
def enqueue(self,job):
pass
def put(self,job):
pass
def delete(self,job):
pass
def recalculate(self):
pass
Because of the updates by enqueue and put I decided not to use trees.
Because of put which moves tasks in time I decided not to use absolute times.
FirstFreeSlot not only returns the task with the free slot but also the other running tasks with their endtimes.
enqueue works as follows:
We look for a free slot by FirstFreeSlot and schedule the task here.
If there is enough space for the next task we can schedule it in too.
If not: look at the other tasks running if they have free space.
If not: run FirstFreeSlot with parameters of this time and running tasks.
improvements:
if put is not used very often and enqueue is done from time zero we could keep track of the overlapping tasks by including a dict() per tasks that contains the other running tasks. Then it is also easy to keep a list() per Resource which contains the scheduled tasks with absolute time for this Resource ordered by endtime. Only those tasks are included that have bigger timegaps than before. Now we can easier find a free slot.
Questions:
Do Tasks scheduled by put need to be executed at that time?
If yes: What if another task to be scheduled by put overlaps?
Do all resources execute a task as fast?

After spend some time thinking through this. I think a segment tree might be more appropriate to model this timeline queue. The job concept is like a LIST data structure.
I assume the Task can be modeled like this (PSEUDO CODE). The sequence of the tasks in the job can be assured by the start_time.
class Task:
name=''
_seg_starttime=-1;
#this is the earliest time the Task can start in the segment tree,
#a lot cases this can be set to -1, which indicates its start after its predecessor,
#this is determined by its predecessor in the segment tree.
#if this is not equal -1, then means this task is specified to start at that time
#whenever the predecessor changed this info need to be taken care of
_job_starttime=0;
#this is the earliest time the Task can start in the job sequence, constrained by job definition
_duration=0;
#this is the time the Task cost to run
def get_segstarttime():
if _seg_starttime == -1 :
return PREDESSOR_NODE.get_segstarttime() + _duration
return __seg_startime + _duration
def get_jobstarttime():
return PREVIOUS_JOB.get_endtime()
def get_starttime():
return max( get_segstarttime(), get_jobstarttime() )
Enqueue it is merely append a task node into the segment tree, notice the _seg_startime set to -1 to indicate it to be started right after it's predecessor
Put insert a segment into the tree, the segment is indicated by start_time and duration.
Delete remove the segment in the tree, update its successor if necessary( say if the deleted node do have a _seg_start_time present )
Recalculate calling the get_starttime() again will directly get its earliest start time.
Examples( without considering the job constraint )
t1.1( _segst = 10, du = 10 )
\
t2.2( _segst = -1, du = 10 ) meaning the st=10+10=20
\
t1.3 (_segst = -1, du = 10 ) meaning the st = 20+10 = 30
if we do a Put:
t1.1( _segst = 10, du = 10 )
\
t2.2( _segst = -1, du = 10 ) meaning the st=20+10=30
/ \
t2.3(_segst = 20, du = 10) t1.3 (_segst = -1, du = 10 ) meaning the st = 30+10 = 30
if we do a Delete t1.1 to original scenario
t2.2( _segst = 20, du = 10 )
\
t1.3 (_segst = -1, du = 10 ) meaning the st = 20+10 = 30
Each resource could be represented using 1 instance of this interval tree
egg.
from the segment tree (timeline) perspective:
t1.1 t3.1
\ / \
t2.2 t2.1 t1.2
from the job perspective:
t1.1 <- t1.2
t2.1 <- t2.2
t3.1
t2.1 and t2.2 are connected using a linked list, as stated: t2.2 get its _sg_start_time from the segment tree, get its _job_start_time from the linked list, compare the two time then the actual earliest time it could run can be derived.

I finally used just a simple list for my queue items and an in-memory SQLite database for storing the empty slots, because multidimensional querying and updating is very efficient with SQL. I only need to store the fields start, duration and index in a table.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.