Implementing a dynamic multiple timeline queue - python

Introduction
I would like to implement a dynamic multiple timeline queue. The context here is scheduling in general.
What is a timeline queue?
This is still simple: It is a timeline of tasks, where each event has its start and end time. Tasks are grouped as jobs. This group of tasks need to preserve its order, but can be moved around in time as a whole. For example it could be represented as:
--t1-- ---t2.1-----------t2.2-------
' ' ' ' '
20 30 40 70 120
I would implement this as a heap queue with some additional constraints. The Python sched module has some basic approaches in this direction.
Definition multiple timeline queue
One queue stands for a resource and a resource is needed by a task. Graphical example:
R1 --t1.1----- --t2.2----- -----t1.3--
/ \ /
R2 --t2.1-- ------t1.2-----
Explaining "dynamic"
It becomes interesting when a task can use one of multiple resources. An additional constraint is that consecutive tasks, which can run on the same resource, must use the same resource.
Example: If (from above) task t1.3 can run on R1 or R2, the queue should look like:
R1 --t1.1----- --t2.2-----
/ \
R2 --t2.1-- ------t1.2----------t1.3--
Functionality (in priority order)
FirstFreeSlot(duration, start): Find the first free time slot beginning from start where there is free time for duration (see detailed explanation at the end).
Enqueue a job as earliest as possible on the multiple resources by regarding the constraints (mainly: correct order of tasks, consecutive tasks on same resource) and using FirstFreeSlot.
Put a job at a specific time and move the tail backwards
Delete a job
Recalculate: After delete, test if some tasks can be executed earlier.
Key Question
The point is: How can I represent this information to provide the functionality efficiently? Implementation is up to me ;-)
Update: A further point to consider: The typical interval structures have the focus on "What is at point X?" But in this case the enqueue and therefore the question "Where is the first empty slot for duration D?" is much more important. So a segment/interval tree or something else in this direction is probably not the right choice.
To elaborate the point with the free slots further: Due to the fact that we have multiple resources and the constraint of grouped tasks there can be free time slots on some resources. Simple example: t1.1 run on R1 for 40 and then t1.2 run on R2. So there is an empty interval of [0, 40] on R2 which can be filled by the next job.
Update 2: There is an interesting proposal in another SO question. If someone can port it to my problem and show that it is working for this case (especially elaborated to multiple resources), this would be probably a valid answer.

Let's restrict ourselves to the simplest case first: Find a suitable data structure that allows for a fast implementation of FirstFreeSlot().
The free time slots live in a two-dimensional space: One dimension is the start time s, the other is the length d. FirstFreeSlot(D) effectively answers the following query:
min s: d >= D
If we think of s and d as a cartesian space (d=x, s=y), this means finding the lowest point in a subplane bounded by a vertical line. A quad-tree, perhaps with some auxiliary information in each node (namely, min s over all leafs), will help answering this query efficiently.
For Enqueue() in the face of resource constraints, consider maintaining a separate quad-tree for each resource. The quad tree can also answer queries like
min s: s >= S & d >= D
(required for restricting the start data) in a similar fashion: Now a rectangle (open at the top left) is cut off, and we look for min s in that rectangle.
Put() and Delete() are simple update operations for the quad-tree.
Recalculate() can be implemented by Delete() + Put(). In order to save time for unnecessary operations, define sufficient (or, ideally, sufficient + necessary) conditions for triggering a recalculation. The Observer pattern might help here, but remember putting the tasks for rescheduling into a FIFO queue or a priority queue sorted by start time. (You want to finish rescheduling the current task before taking over to the next.)
On a more general note, I'm sure you are aware that most kind of scheduling problems, especially those with resource constraints, are NP-complete at least. So don't expect an algorithm with a decent runtime in the general case.

class Task:
name=''
duration=0
resources=list()
class Job:
name=''
tasks=list()
class Assignment:
task=None
resource=None
time=None
class MultipleTimeline:
assignments=list()
def enqueue(self,job):
pass
def put(self,job):
pass
def delete(self,job):
pass
def recalculate(self):
pass
Is this a first step in the direction you are looking for, i.e. a data model written out in Python?
Update:
Hereby my more efficient model:
It basicly puts all Tasks in a linked list ordered by endtime.
class Task:
name=''
duration=0 # the amount of work to be done
resources=0 # bitmap that tells what resources this task uses
# the following variables are only used when the task is scheduled
next=None # the next scheduled task by endtime
resource=None # the resource this task is scheduled
gap=None # the amount of time before the next scheduled task starts on this resource
class Job:
id=0
tasks=list() # the Task instances of this job in order
class Resource:
bitflag=0 # a bit flag which operates bitwisely with Task.resources
firsttask=None # the first Task instance that is scheduled on this resource
gap=None # the amount of time before the first Task starts
class MultipleTimeline:
resources=list()
def FirstFreeSlot():
pass
def enqueue(self,job):
pass
def put(self,job):
pass
def delete(self,job):
pass
def recalculate(self):
pass
Because of the updates by enqueue and put I decided not to use trees.
Because of put which moves tasks in time I decided not to use absolute times.
FirstFreeSlot not only returns the task with the free slot but also the other running tasks with their endtimes.
enqueue works as follows:
We look for a free slot by FirstFreeSlot and schedule the task here.
If there is enough space for the next task we can schedule it in too.
If not: look at the other tasks running if they have free space.
If not: run FirstFreeSlot with parameters of this time and running tasks.
improvements:
if put is not used very often and enqueue is done from time zero we could keep track of the overlapping tasks by including a dict() per tasks that contains the other running tasks. Then it is also easy to keep a list() per Resource which contains the scheduled tasks with absolute time for this Resource ordered by endtime. Only those tasks are included that have bigger timegaps than before. Now we can easier find a free slot.
Questions:
Do Tasks scheduled by put need to be executed at that time?
If yes: What if another task to be scheduled by put overlaps?
Do all resources execute a task as fast?

After spend some time thinking through this. I think a segment tree might be more appropriate to model this timeline queue. The job concept is like a LIST data structure.
I assume the Task can be modeled like this (PSEUDO CODE). The sequence of the tasks in the job can be assured by the start_time.
class Task:
name=''
_seg_starttime=-1;
#this is the earliest time the Task can start in the segment tree,
#a lot cases this can be set to -1, which indicates its start after its predecessor,
#this is determined by its predecessor in the segment tree.
#if this is not equal -1, then means this task is specified to start at that time
#whenever the predecessor changed this info need to be taken care of
_job_starttime=0;
#this is the earliest time the Task can start in the job sequence, constrained by job definition
_duration=0;
#this is the time the Task cost to run
def get_segstarttime():
if _seg_starttime == -1 :
return PREDESSOR_NODE.get_segstarttime() + _duration
return __seg_startime + _duration
def get_jobstarttime():
return PREVIOUS_JOB.get_endtime()
def get_starttime():
return max( get_segstarttime(), get_jobstarttime() )
Enqueue it is merely append a task node into the segment tree, notice the _seg_startime set to -1 to indicate it to be started right after it's predecessor
Put insert a segment into the tree, the segment is indicated by start_time and duration.
Delete remove the segment in the tree, update its successor if necessary( say if the deleted node do have a _seg_start_time present )
Recalculate calling the get_starttime() again will directly get its earliest start time.
Examples( without considering the job constraint )
t1.1( _segst = 10, du = 10 )
\
t2.2( _segst = -1, du = 10 ) meaning the st=10+10=20
\
t1.3 (_segst = -1, du = 10 ) meaning the st = 20+10 = 30
if we do a Put:
t1.1( _segst = 10, du = 10 )
\
t2.2( _segst = -1, du = 10 ) meaning the st=20+10=30
/ \
t2.3(_segst = 20, du = 10) t1.3 (_segst = -1, du = 10 ) meaning the st = 30+10 = 30
if we do a Delete t1.1 to original scenario
t2.2( _segst = 20, du = 10 )
\
t1.3 (_segst = -1, du = 10 ) meaning the st = 20+10 = 30
Each resource could be represented using 1 instance of this interval tree
egg.
from the segment tree (timeline) perspective:
t1.1 t3.1
\ / \
t2.2 t2.1 t1.2
from the job perspective:
t1.1 <- t1.2
t2.1 <- t2.2
t3.1
t2.1 and t2.2 are connected using a linked list, as stated: t2.2 get its _sg_start_time from the segment tree, get its _job_start_time from the linked list, compare the two time then the actual earliest time it could run can be derived.

I finally used just a simple list for my queue items and an in-memory SQLite database for storing the empty slots, because multidimensional querying and updating is very efficient with SQL. I only need to store the fields start, duration and index in a table.

Related

Avoid increased runtime when opening threads in consecutive runs

I'm doing my final thesis and my topic is the creation of a software that will run and control an on-satellite experiment.
For that reason, I had to implement the reading of multiple sensors while the experiment is running. To do that, I wrote the code so that it will create a new thread for each sensor (multiprocessing might not work because I don't yet know which system the software will run on and therefore I can't say if there will be multiple processors available) and these threads run as daemons all the while the software does its thing. It works well, but now I need to test the whole thing and this is where it gets problematic:
To properly test each and every route the software could take, I have multiple variables that need to be set and so there will be a lot of test runs (I calculated around 17.000 but could be wrong). While the first few test runs go over quickly, each run takes longer and longer. I have fiddled around with my code a little bit and it turns out that without threading, each test takes about the same time. Unfortunately, I do not know why and my knowledge of the matter is very limited. The code concerning the threading is as follows:
This sets up the creation of each thread (sensor_list will be populated with multiple sensors in non-test conditions)
sensor_list = [<a single sensor>]
for sensor in sensor_list:
thread = threading.Thread(
target=self.store_sensor_data,
args=[sensor, query_frequency],
daemon=True,
name=f"Thread_{sensor}",
)
self.threads.append(thread)
thread.start()
The function which actually deals with getting and writing the sensor data, self.store_sensor_data, looks like this:
def store_sensor_data(self, sensor, frequency):
"""Get the current reading and result from 'sensor' and store them.
sensor (Sensor) - the sensor whose data shall be stored
frequency (int) - the frequency (in 1/s) at which data shall be stored
"""
value_id = 0
while not self.HALT:
value_id += 1
sensor_reading = sensor.get_reading()
sensor_result = sensor.get_result()
try:
# if there already is a list for that sensor, append the data to it
self.experiment_report.sensor_data_raw[str(sensor)].append(
(value_id, sensor_reading)
)
except KeyError:
# if there is no list, create one containing the current sensor value
self.experiment_report.sensor_data_raw[str(sensor)] = [
(value_id, sensor_reading)
]
# repeat the same for the 'result'
try:
self.experiment_report.sensor_data[str(sensor)].append(
(value_id, sensor_result)
)
except KeyError:
self.experiment_report.sensor_data[str(sensor)] = [
(value_id, sensor_result)
]
time.sleep(1 / frequency)
after the experiment is done, I stop the threads by calling
def interrupt_sensor_data_recording(self):
"""Interrupt the storing of sensor data by ending all daemon threads.
threads (list) - a list of currently running threads
"""
if len(self.threads) > 0:
self.HALT = True
for thread in self.threads:
if thread.is_alive():
logger.debug(f"Stopping thread '{thread.getName()}'")
thread.join()
else:
thread.join()
logger.debug(f"Thread '{thread.getName()}' was already stopped")
Now I am unsure if how I stop the daemon threads is appropriate and this might be the source of my problems. But there also might be some implication that I don't know about yet and in both cases, it would be nice if someone with more knowledge than me could help me out here.
Thanks in advance!

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5
Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Interact with celery ongoing task

We have a distributed architecture based on rabbitMQ and Celery.
We can launch in parallel multiple tasks without any issue. The scalability is good.
Now we need to control the task remotely: PAUSE, RESUME, CANCEL.
The only solution we found is to make in the Celery task a RPC call to another task that replies the command after a DB request. The Celery task and RPC task are not on the same machine and only the RPC task has access to the DB.
Do you have any advice how to improve it and easily communicate with an ongoing task?
Thank you
EDIT:
In fact we would like to do something like in the picture below. It's easy to do the Blue configuration or the Orange, but we don't know how to do both simultaneously.
Workers are subscribing to a common Jobs queue and each worker has its own Admin queue declared on an exchange.
EDIT:
IF this is not possible with Celery, I'am open to a solution with other frameworks like python-rq.
It look like the Control Bus pattern.
For a better scalability and in order to reduce the RPC call, I recommend to reverse the logic. The PAUSE, RESUME, CANCEL command are push to the Celery tasks through a control bus when the state change occurs. The Celery app will store the current state of the Celery app in a store (could be in memory, on the filesystem..). If task states must be kept even after a stop/start of the app, It will involve more work in order to keep both app synchronized (eg. synchronization at startup).
I'd like to demonstrate a general approach to implementing pause-able (and resume-able) ongoing celery tasks through the workflow pattern. Note: Original answer written here. Re-writing here due to this post being very relevant.
Concept
With celery workflows - you can design your entire operation to be divided into a chain of tasks. It doesn't necessarily have to be purely a chain, but it should follow the general concept of one task after another task (or task group) finishes.
Once you have a workflow like that, you can finally define points to pause at throughout your workflow. At each of these points, you can check whether or not the frontend user has requested the operation to pause and continue accordingly. The concept is this:-
A complex and time consuming operation O is split into 5 celery tasks - T1, T2, T3, T4, and T5 - each of these tasks (except the first one) depend on the return value of the previous task.
Let's assume we define points to pause after every single task, so the workflow looks like-
T1 executes
T1 completes, check if user has requested pause
If user has not requested pause - continue
If user has requested pause, serialize the remaining workflow chain and store it somewhere to continue later
... and so on. Since there's a pause point after each task, that check is performed after every one of them (except the last one of course).
But this is only theory, I struggled to find an implementation of this anywhere online so here's what I came up with-
Implementation
from typing import Any, Optional
from celery import shared_task
from celery.canvas import Signature, chain, signature
#shared_task(bind=True)
def pause_or_continue(
self, retval: Optional[Any] = None, clause: dict = None, callback: dict = None
):
# Task to use for deciding whether to pause the operation chain
if signature(clause)(retval):
# Pause requested, call given callback with retval and remaining chain
# chain should be reversed as the order of execution follows from end to start
signature(callback)(retval, self.request.chain[::-1])
self.request.chain = None
else:
# Continue to the next task in chain
return retval
def tappable(ch: chain, clause: Signature, callback: Signature, nth: Optional[int] = 1):
'''
Make a operation workflow chain pause-able/resume-able by inserting
the pause_or_continue task for every nth task in given chain
ch: chain
The workflow chain
clause: Signature
Signature of a task that takes one argument - return value of
last executed task in workflow (if any - othewise `None` is passsed)
- and returns a boolean, indicating whether or not the operation should continue
Should return True if operation should continue normally, or be paused
callback: Signature
Signature of a task that takes 2 arguments - return value of
last executed task in workflow (if any - othewise `None` is passsed) and
remaining chain of the operation workflow as a json dict object
No return value is expected
This task will be called when `clause` returns `True` (i.e task is pausing)
The return value and the remaining chain can be handled accordingly by this task
nth: Int
Check `clause` after every nth task in the chain
Default value is 1, i.e check `clause` after every task
Hence, by default, user given `clause` is called and checked
after every task
NOTE: The passed in chain is mutated in place
Returns the mutated chain
'''
newch = []
for n, sig in enumerate(ch.tasks):
if n != 0 and n % nth == nth - 1:
newch.append(pause_or_continue.s(clause=clause, callback=callback))
newch.append(sig)
ch.tasks = tuple(newch)
return ch
Explanation - pause_or_continue
Here pause_or_continue is the aforementioned pause point. It's a task that will be called at specific intervals (intervals as in task intervals, not as in time intervals). This task then calls a user provided function (actually a task) - clause - to check whether or not the task should continue.
If the clause function (actually a task) returns True, the user provided callback function is called, the latest return value (if any - None otherwise) is passed onto this callback, as well as the remaining chain of tasks. The callback does what it needs to do and pause_or_continue sets self.request.chain to None, which tells celery "The task chain is now empty - everything is finished".
If the clause function (actually a task) returns False, the return value from the previous task (if any - None otherwise) is returned back for the next task to receive - and the chain goes on. Hence the workflow continues.
Why are clause and callback task signatures and not regular functions?
Both clause and callback are being called directly - without delay or apply_async. It is executed in the current process, in the current context. So it behaves exactly like a normal function, then why use signatures?
The answer is serialization. You can't conveniently pass a regular function object to a celery task. But you can pass a task signature. That's exactly what I'm doing here. Both clause and callback should be a regular signature object of a celery task.
What is self.request.chain?
self.request.chain stores a list of dicts (representing jsons as the celery task serializer is json by default) - each of them representing a task signature. Each task from this list is executed in reverse order. Which is why, the list is reversed before passing to the user provided callback function (actually a task) - the user probably expects the order of tasks to be left to right.
Quick note: Irrelevant to this discussion, but if you're using the link parameter from apply_async to construct a chain instead of the chain primitive itself. self.request.callback is the property to be modified (i.e set to None to remove callback and stop chain) instead of self.request.chain
Explanation - tappable
tappable is just a basic function that takes a chain (which is the only workflow primitive covered here, for brevity) and inserts pause_or_continue after every nth task. You can insert them wherever you want really, it is upto you to define pause points in your operation. This is just an example!
For each chain object, the actual signatures of tasks (in order, going from left to right) is stored in the .tasks property. It's a tuple of task signatures. So all we have to do, is take this tuple, convert into a list, insert the pause points and convert back to a tuple to assign to the chain. Then return the modified chain object.
The clause and callback is also attached to the pause_or_continue signature. Normal celery stuff.
That covers the primary concept, but to showcase a real project using this pattern (and also to showcase the resuming part of a paused task), here's a small demo of all the necessary resources
Usage
This example usage assumes the concept of a basic web server with a database. Whenever an operation (i.e workflow chain) is started, it's assigned an id and stored into the database. The schema of that table looks like-
-- Create operations table
-- Keeps track of operations and the users that started them
CREATE TABLE operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
requester_id INTEGER NOT NULL,
completion TEXT NOT NULL,
workflow_store TEXT,
result TEXT,
FOREIGN KEY (requester_id) REFERENCES user (id)
);
The only field that needs to be known about right now, is completion. It just stores the status of the operation-
When the operation starts and a db entry is created, this is set to IN PROGRESS
When a user requests pause, the route controller (i.e view) modifies this to REQUESTING PAUSE
When the operation actually gets paused and callback (from tappable, inside pause_or_continue) is called, the callback should modify this to PAUSED
When the task is completed, this should be modified to COMPLETED
An example of clause
#celery.task()
def should_pause(_, operation_id: int):
# This is the `clause` to be used for `tappable`
# i.e it lets celery know whether to pause or continue
db = get_db()
# Check the database to see if user has requested pause on the operation
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
return operation["completion"] == "REQUESTING PAUSE"
This is the task to call at the pause points, to determine whether or not to pause. It's a function that takes 2 parameters.....well sort of. The first one is mandatory, tappable requires the clause to have one (and exactly one) argument - so it can pass the previous task's return value to it (even if that return value is None). In this example, the return value isn't required to be used - so we can just ignore it.
The second parameter is an operation id. See, all this clause does - is check a database for the operation (the workflow) entry and see if it has the status REQUESTING PAUSE. To do that, it needs to know the operation id. But clause should be a task with one argument, what gives?
Well, good thing signatures can be partial. When the task is first started and a tappable chain is created. The operation id is known and hence we can do should_pause.s(operation_id) to get the signature of a task that takes one parameter, that being the return value of the previous task. That qualifies as a clause!
An example of callback
import os
import json
from typing import Any, List
#celery.task()
def save_state(retval: Any, chains: dict, operation_id: int):
# This is the `callback` to be used for `tappable`
# i.e this is called when an operation is pausing
db = get_db()
# Prepare directories to store the workflow
operation_dir = os.path.join(app.config["OPERATIONS"], f"{operation_id}")
workflow_file = os.path.join(operation_dir, "workflow.json")
if not os.path.isdir(operation_dir):
os.makedirs(operation_dir, exist_ok=True)
# Store the remaining workflow chain, serialized into json
with open(workflow_file, "w") as f:
json.dump(chains, f)
# Store the result from the last task and the workflow json path
db.execute(
"""
UPDATE operations
SET completion = ?,
workflow_store = ?,
result = ?
WHERE id = ?
""",
("PAUSED", workflow_file, f"{retval}", operation_id),
)
db.commit()
And here's the task to be called when the task is being paused. Remember, this should take the last executed task's return value and the remaining list of signatures (in order, from left to right). There's an extra param - operation_id - once again. The explanation for this is the same as the one for clause.
This function stores the remaining chain in a json file (since it's a list of dicts). Remember, you can use a different serializer - I'm using json since it's the default task serializer used by celery.
After storing the remaining chain, it updates the completion status to PAUSED and also logs the path to the json file into the db.
Now, let's see these in action-
An example of starting the workflow
def start_operation(user_id, *operation_args, **operation_kwargs):
db = get_db()
operation_id: int = db.execute(
"INSERT INTO operations (requester_id, completion) VALUES (?, ?)",
(user_id, "IN PROGRESS"),
).lastrowid
# Convert a regular workflow chain to a tappable one
tappable_workflow = tappable(
(T1.s() | T2.s() | T3.s() | T4.s() | T5.s(operation_id)),
should_pause.s(operation_id),
save_state.s(operation_id),
)
# Start the chain (i.e send task to celery to run asynchronously)
tappable_workflow(*operation_args, **operation_kwargs)
db.commit()
return operation_id
A function that takes in a user id and starts an operation workflow. This is more or less an impractical dummy function modeled around a view/route controller. But I think it gets the general idea through.
Assume T[1-4] are all unit tasks of the operation, each taking the previous task's return as an argument. Just an example of a regular celery chain, feel free to go wild with your chains.
T5 is a task that saves the final result (result from T4) to the database. So along with the return value from T4 it needs the operation_id. Which is passed into the signature.
An example of pausing the workflow
def pause(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "IN PROGRESS":
# Pause only if the operation is in progress
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("REQUESTING PAUSE", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
This employs the previously mentioned concept of modifying the db entry to change completion to REQUESTING PAUSE. Once this is committed, the next time pause_or_continue calls should_pause, it'll know that the user has requested the operation to pause and it'll do so accordingly.
An example of resuming the workflow
def resume(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "PAUSED":
# Resume only if the operation is paused
with open(operation["workflow_store"]) as f:
# Load the remaining workflow from the json
workflow_json = json.load(f)
# Load the chain from the json (i.e deserialize)
workflow_chain = chain(signature(x) for x in serialized_ch)
# Start the chain and feed in the last executed task result
workflow_chain(operation["result"])
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("IN PROGRESS", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
Recall that, when the operation is paused - the remaining workflow is stored in a json. Since we are currently restricting the workflow to a chain object. We know this json is a list of signatures that should be turned into a chain. So, we deserialize it accordingly and send it to the celery worker.
Note that, this remaining workflow still has the pause_or_continue tasks as they were originally - so this workflow itself, is once again pause-able/resume-able. When it pauses, the workflow.json will simply be updated.

python - multiprocessing - static tree traversal - performance gain?

I have a node tree where every node has an id (node number), a list over children and a debth indicator. I am then given a list over nodes which i am to find the debth of. To do this i use a recursive function.
This is all fine and dandy but I want to speed the process up. I've been looking into multiprocessing, but every time I try it, the calculation time goes up (the higher process count, the longer runtime) compared to using no other processes at all.
My code looks like junk from trying to understand a lot of different examples, so il post this psuedocode instead.
class Node:
id = int
children = int[]
debth = int
function makeNodeTree() ...
function find(x, node):
for c in node.children:
if c.id == x: return c
else:
if find(x, c) != None: return result
return None
function main():
search = [nodeid, nodeid, nodeid...]
timerstart
for x in search: find(x, rootNode)
timerstop
timerstart
<split list over number of processes>
<do some multiprocess magic>
<get results>
timerstop
compare the two
I've tried all kinds off tree sizes to see if there is any gain at all, but i have yet to find such a case, which leads me thinking I'm doing something wrong. I guess what I'm asking for is an example/way of doing this traversal with a performance gain, using multiprocessing.
I know there are plenty ways to organize nodes to make this task easy, but i want to check the possible(?) performance boost, if it is possible at all.
Multiprocessing has overhead because every time you add a process it takes time to set it up. Also if you are using standard Python threads you are unlikely to get any speedup because all the threads will still run on one processor. So three thoughts (1) are your really so big that you need to speed it up? (2) spawn subprocesses (3) don't use paralellism at each node, just at the top few levels to minimize overhead.

Avoiding race conditions in Python 3's multiprocessing Queues

I'm trying to find the maximum weight of about 6.1 billion (custom) items and I would like to do this with parallel processing. For my particular application there are better algorithms that don't require my iterating over 6.1 billion items, but the textbook that explains them is over my head and my boss wants this done in 4 days. I figured I have a better shot with my company's fancy server and parallel processing. However, everything I know about parallel processing comes from reading the Python documentation. Which is to say I'm pretty lost...
My current theory is to set up a feeder process, an input queue, a whole bunch (say, 30) of worker processes, and an output queue (finding the maximum element in the output queue will be trivial). What I don't understand is how the feeder process can tell the worker processes when to stop waiting for items to come through the input queue.
I had thought about using multiprocessing.Pool.map_async on my iterable of 6.1E9 items, but it takes nearly 10 minutes just to iterate through the items without doing anything to them. Unless I'm misunderstanding something..., having map_async iterate through them to assign them to processes could be done while the processes begin their work. (Pool also provides imap but the documentation says it's similar to map, which doesn't appear to work asynchronously. I want asynchronous, right?)
Related questions: Do I want to use concurrent.futures instead of multiprocessing? I couldn't be the first person to implement a two-queue system (that's exactly how the lines at every deli in America work...) so is there a more Pythonic/built-in way to do this?
Here's a skeleton of what I'm trying to do. See the comment block in the middle.
import multiprocessing as mp
import queue
def faucet(items, bathtub):
"""Fill bathtub, a process-safe queue, with 6.1e9 items"""
for item in items:
bathtub.put(item)
bathtub.close()
def drain_filter(bathtub, drain):
"""Put maximal item from bathtub into drain.
Bathtub and drain are process-safe queues.
"""
max_weight = 0
max_item = None
while True:
try:
current_item = bathtub.get()
# The following line three lines are the ones that I can't
# quite figure out how to trigger without a race condition.
# What I would love is to trigger them AFTER faucet calls
# bathtub.close and the bathtub queue is empty.
except queue.Empty:
drain.put((max_weight, max_item))
return
else:
bathtub.task_done()
if not item.is_relevant():
continue
current_weight = item.weight
if current_weight > max_weight:
max_weight = current_weight
max_item = current_item
def parallel_max(items, nprocs=30):
"""The elements of items should have a method `is_relevant`
and an attribute `weight`. `items` itself is an immutable
iterator object.
"""
bathtub_q = mp.JoinableQueue()
drain_q = mp.Queue()
faucet_proc = mp.Process(target=faucet, args=(items, bathtub_q))
worker_procs = mp.Pool(processes=nprocs)
faucet_proc.start()
worker_procs.apply_async(drain_filter, bathtub_q, drain_q)
finalists = []
for i in range(nprocs):
finalists.append(drain_q.get())
return max(finalists)
HERE'S THE ANSWER
I found a very thorough answer to my question, and a gentle introduction to multitasking from Python Foundation communications director Doug Hellman. What I wanted was the "poison pill" pattern. Check it out here: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
Props to #MRAB for posting the kernel of that concept.
You could put a special terminating item, such as None, into the queue. When a worker sees it, it can put it back for the other workers to see, and then terminate. Alternatively, you could put one special terminating item per worker into the queue.

Categories

Resources