I'm using Celery 3.1.9 with a Redis backend. The job that I'm running is made of several subtasks which run in chords and chains. The structure looks like this:
prepare
download data (a chord of 2 workers)
parse and store downloaded data
long-running chord of 4 workers
finalize
generate report
Each item in the list is a subtask, they are all chained together. Steps 2 and 4 are chords. The whole thing is wired up by creating a chord for step 4 whose callback is a chain of 4 -> 6, then a chord is created for step 2, whose callback is 3 -> first chord. Then, finally a chain is created 1 -> second chord. This chain is then started with delay() and its ID is stored in the database.
The problem is two-fold. First I want to be able to revoke the whole thing and second I want to have a custom on_failure on my Task class that does some cleanup, and reports the failure to the user.
Currently I store the chain's task ID. I thought I could use this to revoke the chain. Also, in case of an error I wanted to walk the chain to its root (in the on_failure handler) to retrieve the relevant record from the database. This doesn't work, because when you re-create an instance of AsyncResult with just the ID of the task, its parent attribute is None.
The second thing I tried was to store the result of serializable() called on the outer chain's result. This however, does not return the entire tree of AsyncResult objects, it just returns the IDs of the first level in the chain (so not the IDs of the children in the chords.)
The third thing I tried was to implement serializable() myself, but as it turns out, the reason why the original method doesn't go further than 2 levels is because the chain's children are celery.canvas.chord objects, instead of AsyncResult instances.
An illustration of the problem:
chord([
foo.si(),
foo.si(),
foo.si(),
], bar.si() | bar.si())
res = chord.apply_async()
pprint(res.serializable())
Prints the following:
(('50c9eb94-7a63-49dc-b491-6fce5fed3713',
('d95a82b7-c107-4a2c-81eb-296dc3fb88c3',
[(('7c72310b-afc7-4010-9de4-e64cd9d30281', None), None),
(('2cb80041-ff29-45fe-b40c-2781b17e59dd', None), None),
(('e85ab83d-dd44-44b5-b79a-2bbf83c4332f', None), None)])),
None)
The first ID is the ID of the callback chain, the second ID is from the chord task itself, and the last three are the actual tasks inside the chord. But I can't get at the result from the task inside the callback chain (i.e. the ID of the two bar.si() calls).
Is there any way to get at the actual task IDs?
One hacky way is calling the tasks with apply_async, save the task ids and wait for them manually. In this way you will have complete control of happens but you should only wait for async tasks as last resort. Now you can access task id, return value, etc. For example something like this:
task1 = a_task.apply_async()
task2 = b_task.apply_async()
task3 = c_task.apply_async()
tasks = [task1, task2, task3]
for task in tasks:
task.wait()
I have a nested dag which is a mixture of groups and chains. The following recursive method works well to get the task_ids and their result:
import celery
def get_task_id_result_tuple_list(run_dag, with_result=True):
task_id_result_list = []
# for groups, parents are first task, then iterate over the children
if isinstance(run_dag, celery.result.GroupResult):
entry = (run_dag.parent, run_dag.parent.result) if with_result else run_dag.parent
task_id_result_list.append(entry)
children = run_dag.children
for child in children:
task_id_result_list.extend(get_task_id_result_tuple_list(child, with_result))
# for AsyncResults, append parents in reverse
elif isinstance(run_dag, celery.result.AsyncResult):
ch = run_dag
ch_list = [(ch, ch.result)] if with_result else [ch]
while ch.parent is not None:
ch = ch.parent
entry = (ch, ch.result) if with_result else ch
ch_list.append(entry)
# remember to reverse the list to get the calling order
task_id_result_list.extend(reversed(ch_list))
return task_id_result_list
# dag is the nested celery structure of chains and groups
run_dag = dag.apply_async()
task_id_result_tuples = get_task_id_result_tuple_list(run_dag)
task_id_only = get_task_id_result_tuple_list(run_dag, False)
NOTE: I have not tested this with chords yet but I imagine it would either work as is or would maybe need another conditional branch to handle that.
Related
We have a distributed architecture based on rabbitMQ and Celery.
We can launch in parallel multiple tasks without any issue. The scalability is good.
Now we need to control the task remotely: PAUSE, RESUME, CANCEL.
The only solution we found is to make in the Celery task a RPC call to another task that replies the command after a DB request. The Celery task and RPC task are not on the same machine and only the RPC task has access to the DB.
Do you have any advice how to improve it and easily communicate with an ongoing task?
Thank you
EDIT:
In fact we would like to do something like in the picture below. It's easy to do the Blue configuration or the Orange, but we don't know how to do both simultaneously.
Workers are subscribing to a common Jobs queue and each worker has its own Admin queue declared on an exchange.
EDIT:
IF this is not possible with Celery, I'am open to a solution with other frameworks like python-rq.
It look like the Control Bus pattern.
For a better scalability and in order to reduce the RPC call, I recommend to reverse the logic. The PAUSE, RESUME, CANCEL command are push to the Celery tasks through a control bus when the state change occurs. The Celery app will store the current state of the Celery app in a store (could be in memory, on the filesystem..). If task states must be kept even after a stop/start of the app, It will involve more work in order to keep both app synchronized (eg. synchronization at startup).
I'd like to demonstrate a general approach to implementing pause-able (and resume-able) ongoing celery tasks through the workflow pattern. Note: Original answer written here. Re-writing here due to this post being very relevant.
Concept
With celery workflows - you can design your entire operation to be divided into a chain of tasks. It doesn't necessarily have to be purely a chain, but it should follow the general concept of one task after another task (or task group) finishes.
Once you have a workflow like that, you can finally define points to pause at throughout your workflow. At each of these points, you can check whether or not the frontend user has requested the operation to pause and continue accordingly. The concept is this:-
A complex and time consuming operation O is split into 5 celery tasks - T1, T2, T3, T4, and T5 - each of these tasks (except the first one) depend on the return value of the previous task.
Let's assume we define points to pause after every single task, so the workflow looks like-
T1 executes
T1 completes, check if user has requested pause
If user has not requested pause - continue
If user has requested pause, serialize the remaining workflow chain and store it somewhere to continue later
... and so on. Since there's a pause point after each task, that check is performed after every one of them (except the last one of course).
But this is only theory, I struggled to find an implementation of this anywhere online so here's what I came up with-
Implementation
from typing import Any, Optional
from celery import shared_task
from celery.canvas import Signature, chain, signature
#shared_task(bind=True)
def pause_or_continue(
self, retval: Optional[Any] = None, clause: dict = None, callback: dict = None
):
# Task to use for deciding whether to pause the operation chain
if signature(clause)(retval):
# Pause requested, call given callback with retval and remaining chain
# chain should be reversed as the order of execution follows from end to start
signature(callback)(retval, self.request.chain[::-1])
self.request.chain = None
else:
# Continue to the next task in chain
return retval
def tappable(ch: chain, clause: Signature, callback: Signature, nth: Optional[int] = 1):
'''
Make a operation workflow chain pause-able/resume-able by inserting
the pause_or_continue task for every nth task in given chain
ch: chain
The workflow chain
clause: Signature
Signature of a task that takes one argument - return value of
last executed task in workflow (if any - othewise `None` is passsed)
- and returns a boolean, indicating whether or not the operation should continue
Should return True if operation should continue normally, or be paused
callback: Signature
Signature of a task that takes 2 arguments - return value of
last executed task in workflow (if any - othewise `None` is passsed) and
remaining chain of the operation workflow as a json dict object
No return value is expected
This task will be called when `clause` returns `True` (i.e task is pausing)
The return value and the remaining chain can be handled accordingly by this task
nth: Int
Check `clause` after every nth task in the chain
Default value is 1, i.e check `clause` after every task
Hence, by default, user given `clause` is called and checked
after every task
NOTE: The passed in chain is mutated in place
Returns the mutated chain
'''
newch = []
for n, sig in enumerate(ch.tasks):
if n != 0 and n % nth == nth - 1:
newch.append(pause_or_continue.s(clause=clause, callback=callback))
newch.append(sig)
ch.tasks = tuple(newch)
return ch
Explanation - pause_or_continue
Here pause_or_continue is the aforementioned pause point. It's a task that will be called at specific intervals (intervals as in task intervals, not as in time intervals). This task then calls a user provided function (actually a task) - clause - to check whether or not the task should continue.
If the clause function (actually a task) returns True, the user provided callback function is called, the latest return value (if any - None otherwise) is passed onto this callback, as well as the remaining chain of tasks. The callback does what it needs to do and pause_or_continue sets self.request.chain to None, which tells celery "The task chain is now empty - everything is finished".
If the clause function (actually a task) returns False, the return value from the previous task (if any - None otherwise) is returned back for the next task to receive - and the chain goes on. Hence the workflow continues.
Why are clause and callback task signatures and not regular functions?
Both clause and callback are being called directly - without delay or apply_async. It is executed in the current process, in the current context. So it behaves exactly like a normal function, then why use signatures?
The answer is serialization. You can't conveniently pass a regular function object to a celery task. But you can pass a task signature. That's exactly what I'm doing here. Both clause and callback should be a regular signature object of a celery task.
What is self.request.chain?
self.request.chain stores a list of dicts (representing jsons as the celery task serializer is json by default) - each of them representing a task signature. Each task from this list is executed in reverse order. Which is why, the list is reversed before passing to the user provided callback function (actually a task) - the user probably expects the order of tasks to be left to right.
Quick note: Irrelevant to this discussion, but if you're using the link parameter from apply_async to construct a chain instead of the chain primitive itself. self.request.callback is the property to be modified (i.e set to None to remove callback and stop chain) instead of self.request.chain
Explanation - tappable
tappable is just a basic function that takes a chain (which is the only workflow primitive covered here, for brevity) and inserts pause_or_continue after every nth task. You can insert them wherever you want really, it is upto you to define pause points in your operation. This is just an example!
For each chain object, the actual signatures of tasks (in order, going from left to right) is stored in the .tasks property. It's a tuple of task signatures. So all we have to do, is take this tuple, convert into a list, insert the pause points and convert back to a tuple to assign to the chain. Then return the modified chain object.
The clause and callback is also attached to the pause_or_continue signature. Normal celery stuff.
That covers the primary concept, but to showcase a real project using this pattern (and also to showcase the resuming part of a paused task), here's a small demo of all the necessary resources
Usage
This example usage assumes the concept of a basic web server with a database. Whenever an operation (i.e workflow chain) is started, it's assigned an id and stored into the database. The schema of that table looks like-
-- Create operations table
-- Keeps track of operations and the users that started them
CREATE TABLE operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
requester_id INTEGER NOT NULL,
completion TEXT NOT NULL,
workflow_store TEXT,
result TEXT,
FOREIGN KEY (requester_id) REFERENCES user (id)
);
The only field that needs to be known about right now, is completion. It just stores the status of the operation-
When the operation starts and a db entry is created, this is set to IN PROGRESS
When a user requests pause, the route controller (i.e view) modifies this to REQUESTING PAUSE
When the operation actually gets paused and callback (from tappable, inside pause_or_continue) is called, the callback should modify this to PAUSED
When the task is completed, this should be modified to COMPLETED
An example of clause
#celery.task()
def should_pause(_, operation_id: int):
# This is the `clause` to be used for `tappable`
# i.e it lets celery know whether to pause or continue
db = get_db()
# Check the database to see if user has requested pause on the operation
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
return operation["completion"] == "REQUESTING PAUSE"
This is the task to call at the pause points, to determine whether or not to pause. It's a function that takes 2 parameters.....well sort of. The first one is mandatory, tappable requires the clause to have one (and exactly one) argument - so it can pass the previous task's return value to it (even if that return value is None). In this example, the return value isn't required to be used - so we can just ignore it.
The second parameter is an operation id. See, all this clause does - is check a database for the operation (the workflow) entry and see if it has the status REQUESTING PAUSE. To do that, it needs to know the operation id. But clause should be a task with one argument, what gives?
Well, good thing signatures can be partial. When the task is first started and a tappable chain is created. The operation id is known and hence we can do should_pause.s(operation_id) to get the signature of a task that takes one parameter, that being the return value of the previous task. That qualifies as a clause!
An example of callback
import os
import json
from typing import Any, List
#celery.task()
def save_state(retval: Any, chains: dict, operation_id: int):
# This is the `callback` to be used for `tappable`
# i.e this is called when an operation is pausing
db = get_db()
# Prepare directories to store the workflow
operation_dir = os.path.join(app.config["OPERATIONS"], f"{operation_id}")
workflow_file = os.path.join(operation_dir, "workflow.json")
if not os.path.isdir(operation_dir):
os.makedirs(operation_dir, exist_ok=True)
# Store the remaining workflow chain, serialized into json
with open(workflow_file, "w") as f:
json.dump(chains, f)
# Store the result from the last task and the workflow json path
db.execute(
"""
UPDATE operations
SET completion = ?,
workflow_store = ?,
result = ?
WHERE id = ?
""",
("PAUSED", workflow_file, f"{retval}", operation_id),
)
db.commit()
And here's the task to be called when the task is being paused. Remember, this should take the last executed task's return value and the remaining list of signatures (in order, from left to right). There's an extra param - operation_id - once again. The explanation for this is the same as the one for clause.
This function stores the remaining chain in a json file (since it's a list of dicts). Remember, you can use a different serializer - I'm using json since it's the default task serializer used by celery.
After storing the remaining chain, it updates the completion status to PAUSED and also logs the path to the json file into the db.
Now, let's see these in action-
An example of starting the workflow
def start_operation(user_id, *operation_args, **operation_kwargs):
db = get_db()
operation_id: int = db.execute(
"INSERT INTO operations (requester_id, completion) VALUES (?, ?)",
(user_id, "IN PROGRESS"),
).lastrowid
# Convert a regular workflow chain to a tappable one
tappable_workflow = tappable(
(T1.s() | T2.s() | T3.s() | T4.s() | T5.s(operation_id)),
should_pause.s(operation_id),
save_state.s(operation_id),
)
# Start the chain (i.e send task to celery to run asynchronously)
tappable_workflow(*operation_args, **operation_kwargs)
db.commit()
return operation_id
A function that takes in a user id and starts an operation workflow. This is more or less an impractical dummy function modeled around a view/route controller. But I think it gets the general idea through.
Assume T[1-4] are all unit tasks of the operation, each taking the previous task's return as an argument. Just an example of a regular celery chain, feel free to go wild with your chains.
T5 is a task that saves the final result (result from T4) to the database. So along with the return value from T4 it needs the operation_id. Which is passed into the signature.
An example of pausing the workflow
def pause(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "IN PROGRESS":
# Pause only if the operation is in progress
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("REQUESTING PAUSE", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
This employs the previously mentioned concept of modifying the db entry to change completion to REQUESTING PAUSE. Once this is committed, the next time pause_or_continue calls should_pause, it'll know that the user has requested the operation to pause and it'll do so accordingly.
An example of resuming the workflow
def resume(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "PAUSED":
# Resume only if the operation is paused
with open(operation["workflow_store"]) as f:
# Load the remaining workflow from the json
workflow_json = json.load(f)
# Load the chain from the json (i.e deserialize)
workflow_chain = chain(signature(x) for x in serialized_ch)
# Start the chain and feed in the last executed task result
workflow_chain(operation["result"])
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("IN PROGRESS", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
Recall that, when the operation is paused - the remaining workflow is stored in a json. Since we are currently restricting the workflow to a chain object. We know this json is a list of signatures that should be turned into a chain. So, we deserialize it accordingly and send it to the celery worker.
Note that, this remaining workflow still has the pause_or_continue tasks as they were originally - so this workflow itself, is once again pause-able/resume-able. When it pauses, the workflow.json will simply be updated.
I want to pass the result of one task to another task. I am using chain
som = chain (task_async_get_me_friends.s((userId), parse_friends.s()))()
q = som.get()
print q
My intention is to create 2 task. First task get the friends of the user and then pass this friends in the JSON object to the parse_friends task. I am getting the result from task_async_get_me_friends but then cannot get parse_friends to be called
#celery.task
def task_async_get_me_friends(userId, *args):
logger.info('First do something')
users_friends = fb_get_friends(userId)
# Till here everything is all good, I did see the celery logger. Getting result from fb
return {'result':'success', 'data':users_friends}
#celery.task
def parse_friends(users_friends,*args,**kwargs):
# This log line i cannot see in the celery
logger.info('Second do something'+str(users_friends))
# Do something with users_friends
EDIT: realized I had misunderstood which function did which
I'm still getting up to speed on celery, but I don't think your chain is doing what you want. Specifically chain takes a list of tasks; you're only providing 1 task (which happens to be using a 2nd task). I think what you want is:
som = chain (task_async_get_me_friends.s(userId),parse_friends.s())
This should call parse_friends and when that result is returned, pass that to task_async_get_me_friends (which has the first argument passed in as userId but is "waiting" for chain to provide the second argument (the json result).
I need to realise following scenario:
Execute task A
Execute multiple task B in parallel with different arguments
Wait for all tasks to finish
Execute multiple task B in parallel with different arguments
Wait for all tasks to finish
Execute task C
I have achieved this by implementing chain of chords, here is simplified code:
# inside run() method of ATask
chord_chain = []
for taskB_group in taskB.groups.all():
tasks = [BTask().si(id=taskB_model.id) for taskB_model in taskB_group.children.all()]
if len(tasks):
chord_chain.append(chord(tasks, _dummy_callback.s()))
chord_chain.append(CTask().si(execution_id))
chain(chord_chain)()
The problem is that I need to have ability to call revoke(terminate=True) on all BTasks in any point of time. The lower level problem is that I can't get to BTask celery ids.
Tried to get BTask ids via chain result = chain(chord_chain)(). But I didn't found that information in returned AsyncResult object. Is it possible to get chain children ids from this object? (result.children is None)
Tried to get BTask ids via ATask AsyncResult, but it seems that children property only contains results of first chord and not the rest of tasks.
>>> r=AsyncResult(#ATask.id#)
>>> r.children
[<GroupResult: 5599ae69-4de0-45c0-afbe-b0e573631abc [#BTask.id#, #BTask.id#]>,
<AsyncResult: #chord_unlock.id#>]
Solved by flagging ATask related model with aborted status flag and adding check at start of BTask.
Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.
Introduction
I would like to implement a dynamic multiple timeline queue. The context here is scheduling in general.
What is a timeline queue?
This is still simple: It is a timeline of tasks, where each event has its start and end time. Tasks are grouped as jobs. This group of tasks need to preserve its order, but can be moved around in time as a whole. For example it could be represented as:
--t1-- ---t2.1-----------t2.2-------
' ' ' ' '
20 30 40 70 120
I would implement this as a heap queue with some additional constraints. The Python sched module has some basic approaches in this direction.
Definition multiple timeline queue
One queue stands for a resource and a resource is needed by a task. Graphical example:
R1 --t1.1----- --t2.2----- -----t1.3--
/ \ /
R2 --t2.1-- ------t1.2-----
Explaining "dynamic"
It becomes interesting when a task can use one of multiple resources. An additional constraint is that consecutive tasks, which can run on the same resource, must use the same resource.
Example: If (from above) task t1.3 can run on R1 or R2, the queue should look like:
R1 --t1.1----- --t2.2-----
/ \
R2 --t2.1-- ------t1.2----------t1.3--
Functionality (in priority order)
FirstFreeSlot(duration, start): Find the first free time slot beginning from start where there is free time for duration (see detailed explanation at the end).
Enqueue a job as earliest as possible on the multiple resources by regarding the constraints (mainly: correct order of tasks, consecutive tasks on same resource) and using FirstFreeSlot.
Put a job at a specific time and move the tail backwards
Delete a job
Recalculate: After delete, test if some tasks can be executed earlier.
Key Question
The point is: How can I represent this information to provide the functionality efficiently? Implementation is up to me ;-)
Update: A further point to consider: The typical interval structures have the focus on "What is at point X?" But in this case the enqueue and therefore the question "Where is the first empty slot for duration D?" is much more important. So a segment/interval tree or something else in this direction is probably not the right choice.
To elaborate the point with the free slots further: Due to the fact that we have multiple resources and the constraint of grouped tasks there can be free time slots on some resources. Simple example: t1.1 run on R1 for 40 and then t1.2 run on R2. So there is an empty interval of [0, 40] on R2 which can be filled by the next job.
Update 2: There is an interesting proposal in another SO question. If someone can port it to my problem and show that it is working for this case (especially elaborated to multiple resources), this would be probably a valid answer.
Let's restrict ourselves to the simplest case first: Find a suitable data structure that allows for a fast implementation of FirstFreeSlot().
The free time slots live in a two-dimensional space: One dimension is the start time s, the other is the length d. FirstFreeSlot(D) effectively answers the following query:
min s: d >= D
If we think of s and d as a cartesian space (d=x, s=y), this means finding the lowest point in a subplane bounded by a vertical line. A quad-tree, perhaps with some auxiliary information in each node (namely, min s over all leafs), will help answering this query efficiently.
For Enqueue() in the face of resource constraints, consider maintaining a separate quad-tree for each resource. The quad tree can also answer queries like
min s: s >= S & d >= D
(required for restricting the start data) in a similar fashion: Now a rectangle (open at the top left) is cut off, and we look for min s in that rectangle.
Put() and Delete() are simple update operations for the quad-tree.
Recalculate() can be implemented by Delete() + Put(). In order to save time for unnecessary operations, define sufficient (or, ideally, sufficient + necessary) conditions for triggering a recalculation. The Observer pattern might help here, but remember putting the tasks for rescheduling into a FIFO queue or a priority queue sorted by start time. (You want to finish rescheduling the current task before taking over to the next.)
On a more general note, I'm sure you are aware that most kind of scheduling problems, especially those with resource constraints, are NP-complete at least. So don't expect an algorithm with a decent runtime in the general case.
class Task:
name=''
duration=0
resources=list()
class Job:
name=''
tasks=list()
class Assignment:
task=None
resource=None
time=None
class MultipleTimeline:
assignments=list()
def enqueue(self,job):
pass
def put(self,job):
pass
def delete(self,job):
pass
def recalculate(self):
pass
Is this a first step in the direction you are looking for, i.e. a data model written out in Python?
Update:
Hereby my more efficient model:
It basicly puts all Tasks in a linked list ordered by endtime.
class Task:
name=''
duration=0 # the amount of work to be done
resources=0 # bitmap that tells what resources this task uses
# the following variables are only used when the task is scheduled
next=None # the next scheduled task by endtime
resource=None # the resource this task is scheduled
gap=None # the amount of time before the next scheduled task starts on this resource
class Job:
id=0
tasks=list() # the Task instances of this job in order
class Resource:
bitflag=0 # a bit flag which operates bitwisely with Task.resources
firsttask=None # the first Task instance that is scheduled on this resource
gap=None # the amount of time before the first Task starts
class MultipleTimeline:
resources=list()
def FirstFreeSlot():
pass
def enqueue(self,job):
pass
def put(self,job):
pass
def delete(self,job):
pass
def recalculate(self):
pass
Because of the updates by enqueue and put I decided not to use trees.
Because of put which moves tasks in time I decided not to use absolute times.
FirstFreeSlot not only returns the task with the free slot but also the other running tasks with their endtimes.
enqueue works as follows:
We look for a free slot by FirstFreeSlot and schedule the task here.
If there is enough space for the next task we can schedule it in too.
If not: look at the other tasks running if they have free space.
If not: run FirstFreeSlot with parameters of this time and running tasks.
improvements:
if put is not used very often and enqueue is done from time zero we could keep track of the overlapping tasks by including a dict() per tasks that contains the other running tasks. Then it is also easy to keep a list() per Resource which contains the scheduled tasks with absolute time for this Resource ordered by endtime. Only those tasks are included that have bigger timegaps than before. Now we can easier find a free slot.
Questions:
Do Tasks scheduled by put need to be executed at that time?
If yes: What if another task to be scheduled by put overlaps?
Do all resources execute a task as fast?
After spend some time thinking through this. I think a segment tree might be more appropriate to model this timeline queue. The job concept is like a LIST data structure.
I assume the Task can be modeled like this (PSEUDO CODE). The sequence of the tasks in the job can be assured by the start_time.
class Task:
name=''
_seg_starttime=-1;
#this is the earliest time the Task can start in the segment tree,
#a lot cases this can be set to -1, which indicates its start after its predecessor,
#this is determined by its predecessor in the segment tree.
#if this is not equal -1, then means this task is specified to start at that time
#whenever the predecessor changed this info need to be taken care of
_job_starttime=0;
#this is the earliest time the Task can start in the job sequence, constrained by job definition
_duration=0;
#this is the time the Task cost to run
def get_segstarttime():
if _seg_starttime == -1 :
return PREDESSOR_NODE.get_segstarttime() + _duration
return __seg_startime + _duration
def get_jobstarttime():
return PREVIOUS_JOB.get_endtime()
def get_starttime():
return max( get_segstarttime(), get_jobstarttime() )
Enqueue it is merely append a task node into the segment tree, notice the _seg_startime set to -1 to indicate it to be started right after it's predecessor
Put insert a segment into the tree, the segment is indicated by start_time and duration.
Delete remove the segment in the tree, update its successor if necessary( say if the deleted node do have a _seg_start_time present )
Recalculate calling the get_starttime() again will directly get its earliest start time.
Examples( without considering the job constraint )
t1.1( _segst = 10, du = 10 )
\
t2.2( _segst = -1, du = 10 ) meaning the st=10+10=20
\
t1.3 (_segst = -1, du = 10 ) meaning the st = 20+10 = 30
if we do a Put:
t1.1( _segst = 10, du = 10 )
\
t2.2( _segst = -1, du = 10 ) meaning the st=20+10=30
/ \
t2.3(_segst = 20, du = 10) t1.3 (_segst = -1, du = 10 ) meaning the st = 30+10 = 30
if we do a Delete t1.1 to original scenario
t2.2( _segst = 20, du = 10 )
\
t1.3 (_segst = -1, du = 10 ) meaning the st = 20+10 = 30
Each resource could be represented using 1 instance of this interval tree
egg.
from the segment tree (timeline) perspective:
t1.1 t3.1
\ / \
t2.2 t2.1 t1.2
from the job perspective:
t1.1 <- t1.2
t2.1 <- t2.2
t3.1
t2.1 and t2.2 are connected using a linked list, as stated: t2.2 get its _sg_start_time from the segment tree, get its _job_start_time from the linked list, compare the two time then the actual earliest time it could run can be derived.
I finally used just a simple list for my queue items and an in-memory SQLite database for storing the empty slots, because multidimensional querying and updating is very efficient with SQL. I only need to store the fields start, duration and index in a table.