Dask task to worker appointment - python

I run a computation graph with long functions (several hours) and big results (several hundred of megabytes). This type of load can be atypical for dask.
I try to run this graph on 4 workers. I see depicted task to worker appointment:
In first row "green" task depends only on a "blue" one, not "violet". Why green task is not moved to other worker?
Is it possible to give some hints to a scheduler to always move a task on a free worker? Which information do you need and possible to obtain which helps to debug more?
Such appointment is non-optimal and graph computation takes more time than necessary.
A little bit information:
Computation graph composing is done using dask.delayed
Computation invocation is done using next code
to_compute = [result_of_dask_delayed_1, ... , result_of_dask_delayed_n]
running = client.persist(to_compute)
results = as_completed(
futures_of(running),
with_results=True,
raise_errors=not bool(cmdline_options.partial_fails),
)

Firstly, we should state stat scheduling tasks to workers is complicated - please read this description. Roughly, when a batch of tasks come in, they will be distributed to workers, accounting for dependencies.
Dask has no idea, before running, how long any given task will take or how big the data result it generates might be. Once a task is done, this information is available for the completed task (but not for ones still waiting to run), but dask needs to make a decision on whether to steal an allotted task to an idle worker, and copy across the data it needs to run. Please read the link to see how this is done.
Finally, you can indeed have more fine-grained control over where things are run, e.g., this section.

Related

Dask jobqueue - Is there a way to start all workers at the same time?

Say if I have the following deployment on SLURM:
cluster = SLURMCluster(processes=1, cores=25, walltime=1-00:00:00)
cluster.scale(20)
client = Client(cluster)
So I will have 20 nodes each with 25 cores.
Is there a way to tell the slurm scheduler to start all nodes at the same time, instead of starting each one individually when they become available?
A specific example: when nodes are being started individually, those that started the earliest might wait for several, say 2, hours until all 20 nodes are ready. This not only is a waste of resources but also this makes my total job time to be less than 24 hour (e.g. 22 hours).
This is something one can do easily with dask_mpi, where a single batch job is allocated. I am wondering if it's possible to do this with dask_jobqueue specifically.
dask-jobqueue itself doesn't propose such a functionality.
It is designed to submit independent jobs. So to achieve this you would have to look at the possibilities of the job queuing system, Slurm in your case, and see if this is possible without dask-jobqueue. Then you should try to add the correct options to dask-jobqueue if you can, though job_extra_directives kwarg for example.
I'm not aware of such a functionality within Slurm, but there are so many knobs it is hard to tell. I know this is not possible with PBS.
A good option to achieve what you want is, as you said so, using dask-mpi.
A final thought, you could also start your computation with the first two nodes, not waiting for the other to be ready. This should be doable in most cases.

How to recover a critical python job from system failure

Is there any python library that would provide a (generic) job state journaling and recovery functionality?
Here's my use case:
data received to start a job
job starts processing
job finishes processing
I then want to be able to restart a job back after 1 if the process aborts / power fails. Jobs would write to a journal file when job starts, and mark the job done when the job completes. So when the process starts, it checks a journal file for uncompleted jobs, and uses the journal data to restart the job(s) that did not complete, if present. So what python tools exist to solve this? (Or other python solutions to having fault tolerance and recovery for critical jobs that must complete). I know a job queue like RabbitMQ would work quite well for this case, but I want a solution that doesn't need an external service. I searched PyPI for "journaling" and didn't get much. So any solutions? Seems like a library for this would be useful, since there are multiple concerns when using a journal that are hard to get right, but a library could handle. (Such as multiple async writes, file splitting and truncating, etc.)
I think you can do this using either crontabs or APScheduler, I think the latter has all the feature you need, but even with cron you can do something like:
1: schedule A process to run after a specific interval
2: A process checks if there is a running job or not
3: if no job is running, start one
4: job continues working, and saves state into drive/db
5: if it fails or finishes, step 3 will continue
APScheduler is likely what you're looking for, their feature list is extensive and it's also extendable if it doesn't fulfill your requirements.

(Unix/Python) Activity Monitor % CPU

I have a project I am working on for developing a distributed computing model for one of my other projects. One of my scripts is multiprocessed, monitoring for changes in directories, decoding pickles, creating authentication keys for them based on the node to which they are headed, and repickling them. (Edit: These processes all operate in loops.)
My question is, when looking at the activity monitor in OSX, my %CPU column displays 100% for the primary processes that run the scripts. These three that are showing 100% are the manager script, and the two nodes (I am simulating the model on one machine, the intent is to move the model to a live cluster network in the future). Is this bad? My system usage shows 27.50%, user 12.50%, and 65% idle.
I've attempted research myself, first, and my only thought is that these numbers display that the process is utilizing the CPU for the entire time it's alive, and is never idle.
Can I please get some clarification?
Update based on comments:
My processes run in an endless loop, monitoring for changes to files in their respective directories, in order to receive new 'jobs' from the manager process/script (a separate computer in the cluster in the project's final implementation). Maybe there is a better way to be waiting for I/O in this form that doesn't require so much processor time?
Solution (If however Sub-Optimal):
Implemented a time.sleep(n) period at the end of each loop for 0.1 seconds. Brought the CPU time down to no more than 0.4%.
Still, though, I am looking for a more optimal way of reducing CPU time, without using modules not included in the Python Standard Library, I'd like to avoid using a time.sleep(n) period, as I want the system to be able to respond at any moment, and if the load of input files gets very high, I do not want it to be wasting time sleeping, when it could be processing the files.
My processes operate by busy waiting, so it was causing them to use excessive CPU time:
while True:
files = os.path.listdir('./sub_directory/')
if files != []:
do_something()
This is the basis for each script/process I am executing.

Task queue for deferred tasks in GAE with python

I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).

How to schedule hundreds of thousands of tasks?

We have hundreds of thousands of tasks that need to be run at a variety of arbitrary intervals, some every hour, some every day, and so on. The tasks are resource intensive and need to be distributed across many machines.
Right now tasks are stored in a database with an "execute at this time" timestamp. To find tasks that need to be executed, we query the database for jobs that are due to be executed, then update the timestamps when the task is complete. Naturally this leads to a substantial write load on the database.
As far as I can tell, we are looking for something to release tasks into a queue at a set interval. (Workers could then request tasks from that queue.)
What is the best way to schedule recurring tasks at scale?
For what it's worth we're largely using Python, although we have no problems using components (RabbitMQ?) written in other languages.
UPDATE: Right now we have about 350,000 tasks that run every half hour or so, with some variation. 350,000 tasks * 48 times per day is 16,800,000 tasks executed per day.
UPDATE 2: There are no dependencies. The tasks do not have to be executed in order and do not rely on previous results.
Since ACID isn't needed and you're okay with tasks potentially running twice, I wouldn't keep the timestamps in the database at all. For each task, create a list of [timestamp_of_next_run, task_id] and use a min-heap to store all of the lists. Python's heapq module can maintain the heap for you. You'll be able to very efficiently pop off the task with the soonest timestamp. When you need to run a task, use its task_id to look up in the database what the task needs to do. When a task completes, update the timestamp and put it back into the heap. (Just be careful not to change an item that's currently in the heap, as that will break the heap properties).
Use the database only to store information that you will still care about after a crash and reboot. If you won't need the information after a reboot, don't spend the time writing to disk. You will still have a lot of database read operations to load the information about a task that needs to run, but a read is much cheaper than a write.
If you don't have enough RAM to store all of the tasks in memory at the same time, you could go with a hybrid setup where you keep the tasks for the next 24 hours (for example) in RAM and everything else stays in the database. Alternately, you could rewrite the code in C or C++, which are less memory hungry.
If you don't want a database, you could store just the next run timestamp and task id in memory. You could store the properties for each task in a file named [task_id].txt. You would need a data structure to store all the tasks, sorted by timestamp in memory, an AVL tree seems like it would work, here's a simple one for python: http://bjourne.blogspot.com/2006/11/avl-tree-in-python.html. Hopefully Linux (I assume that's what you are running on) could handle millions of files in a directory, otherwise you might need to hash on the task id to get a sub folder).
Your master server would just need to run a loop, popping off tasks out of the AVL tree until the next task's timestamp is in the future. Then you could sleep for a few seconds and start checking again. Whenever a task runs, you would update the next run timestamp in the task file and re-insert it into the AVL tree.
When the master server reboots, there would be the overhead of reloading all tasks id and next run timestamp back into memory, so that might be painful with millions of files. Maybe you just have one giant file and give each task 1K space in the file for properties and next run timestamp and then use [task_id] * 1K to get to the right offset for the task properties.
If you are willing to use a database, I am confident MySQL could handle whatever you throw at it given the conditions you describe, assuming you have 4GB+ RAM and several hard drives in RAID 0+1 on your master server.
Finally, if you really want to get complicated, Hadoop might work too: http://hadoop.apache.org/
If you're worried about writes, you can have a set of servers that dispatch the tasks (may be stripe the servers to equalize load) and have each server write bulk checkpoints to the DB (this way, you will not have so many write queries). You still have to write to be able to recover if scheduling server dies, of course.
In addition, if you don't have a clustered index on timestamp, you will avoid having a hot-spot at the end of the table.
350,000 tasks * 48 times per day is
16,800,000 tasks executed per day.
To schedule the jobs, you don't need a database.
Databases are for things that are updated. The only update visible here is a change to the schedule to add, remove or reschedule a job.
Cron does this in a totally scalable fashion with a single flat file.
Read the entire flat file into memory, start spawning jobs. Periodically, check the fstat to see if the file changed. Or, even better, wait for a HUP signal and use that to reread the file. Use kill -HUP to signal the scheduler to reread the file.
It's unclear what you're updating the database for.
If the database is used to determine future schedule based on job completion, then a single database is a Very Dad Idea.
If you're using the database to do some analysis of job history, then you have a simple data warehouse.
Record completion information (start time, end time, exit status, all that stuff) in a simple flat log file.
Process the flat log files to create a fact table and dimension updates.
When someone has the urge to do some analysis, load relevant portions of the flat log files into a datamart so they can do queries and counts and averages and the like.
Do not directly record 17,000,000 rows per day into a relational database. No one wants all that data. They want summaries: counts and averages.
Why hundreds of thousands and not hundreds of millions ? :evil:
I think you need stackless python, http://www.stackless.com/. created by the genius of Christian Tismer.
Quoting
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure. More
readable code. Increased programmer
productivity.
Is used for massive multiplayer games.

Categories

Resources