Cluster job scheduler: tools

Cluster job scheduler: tools - python

we are trying to solve a problem related to cluster job scheduler.
The problem is the following we have a set of python scripts which are executed in a cluster, the launching process is currently done by means of the human interaction, I mean to start the test we have a bash script which interact with the cluster to request the resources needed for the execution. What we are intending to do is to build an automatic launching process (which should be sound in the sense that it realizes the job status and based on that wait the job ending, restart the execution, etc...). Basically we have to implement a layer between the user workstation and the cluster.
Another additional difficulty is that our layer must be clever enough to interact with the different cluster job schedulers. We wonder if there exists a tool or framework which help us to interact with the cluster without having to deal with each cluster scheduler details. We have searched in the web but we did not find anything suitable for our needs.
By the way the programming language we use is Python.
Thanks in advance!
Br.-

Use supervisor: http://supervisord.org/
and celery http://www.celeryproject.org/
together

Take a look at the ipcluster_tools. The documentation is sparse but it is easy to use.

Related

Best way to manage Terraform apply infra -AWS batch- run a script (could take 1hr to 1day) - destroy infra

Hi All I am planning to build a system for my team where we can start a AWS batch infra - run a task - once job done destroy the infra.
I am thinking of : Make file steps:- 1. Terraform apply AWS batch infra, 2. Run the task, 3. Check on regular interval if the task is complete 4. If the task is complete destroy the infra.
What is most efficient way for doing this. Given our team would need wide variety of task to run on the AWS batch, we want to automate where we can just make one command do this.
Should we explore Airflow for this? Or is there better way to so this? Your thoughts would be highly appreciated. Thank you

Many companies have chosen to build their own Terraform Automation & Collaboration Software (TACOS) but it's a lot less work to use an existing service such as the open source atlantis or an enterprise saas platform such as spacelift or terraform cloud.
However, if you were to create your own, you would need to confirm plans safely. The tools above can use rego from open policy agent to do so.
From your workflow, it sounds like you simply need a tool to auto apply your changes. I have seen a cron job running in jenkins that can do the trick. You can also run a cron ecs or ecs fargate task on a scheduled interval. Airflow seems like overkill.
If I were you, I'd strongly look at all the options and list the pros and cons of each before considering rolling your own. I'm interested to know if the above services have shortcomings that warrant your team to build a new service.

python multiprocess - launching jobs on availble slots

I'm launching jobs on a server.
The server can process only one job at a time
So, I use the trick of using several user accounts on the server : userA , userB , userC , userD
for the moment I launch job with a function
run_job_on_server(some_args , user_name)
my question is quite simple : how can , using multiprocess (or another module), launch many jobs using the different users available, and when a job finished, make the user re-available and immediately after launch a new job with this user
Thanks for your help !

I think your question goes into the library selection (multiprocessing) too quickly. The first thing to do is to establish the design pattern. As a start, I think you could look at the dispatcher or mailbox pattern, and the active object pattern.
As for libraries, you're not stuck with the python standard lib. pip has many nice options too. I personally love ZeroMQ for distributed systems, but that's step two. Maybe the standard lib like Queue and multiprocessing will do.

DAG(directed acyclic graph) dynamic job scheduler

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.
Are there any frameworks in python that can handle this?
I see several core functions:
DAG bulding
Execution of nodes (run shell cmd with wait,logging etc.)
Ability to rebuild sub-graph in parent DAG during execution
Ability to manual execute nodes or sub-graph while parent graph is running
Suspend graph execution while waiting for external event
List job queue and job details
Something like Oozie, but more general purpose and in python.

1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.
2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Both of the two modules are mainly written in Python and web interfaces are included for convenient management.
As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.
3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.
4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.
Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html

There are a ton of these; everyone seems to write their own. There is a good list at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems. Which includes systems that originate in both industry and academia.

Have you looked at Ruffus?
I have no experience with it but it appears to do some of the items on your list. It also looks quite hackable so you might be able to implement your other requirements yourself.

Python parallel processing libraries

Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.

Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.

There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.

Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.

The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.

python long running daemon job processor

I want to write a long running process (linux daemon) that serves two purposes:
responds to REST web requests
executes jobs which can be scheduled
I originally had it working as a simple program that would run through runs and do the updates which I then cron’d, but now I have the added REST requirement, and would also like to change the frequency of some jobs, but not others (let’s say all jobs have different frequencies).
I have 0 experience writing long running processes, especially ones that do things on their own, rather than responding to requests.
My basic plan is to run the REST part in a separate thread/process, and figured I’d run the jobs part separately.
I’m wondering if there exists any patterns, specifically python, (I’ve looked and haven’t really found any examples of what I want to do) or if anyone has any suggestions on where to begin with transitioning my project to meet these new requirements.
I’ve seen a few projects that touch on scheduling, but I’m really looking for real world user experience / suggestions here. What works / doesn’t work for you?

If the REST server and the scheduled jobs have nothing in common, do two separate implementations, the REST server and the jobs stuff, and run them as separate processes.
As mentioned previously, look into existing schedulers for the jobs stuff. I don't know if Twisted would be an alternative, but you might want to check this platform.
If, OTOH, the REST interface invokes the same functionality as the scheduled jobs do, you should try to look at them as two interfaces to the same functionality, e.g. like this:
Write the actual jobs as programs the REST server can fork and run.
Have a separate scheduler that handles the timing of the jobs.
If a job is due to run, let the scheduler issue a corresponding REST request to the local server.
This way the scheduler only handles job descriptions, but has no own knowledge how they are implemented.
It's a common trait for long-running, high-availability processes to have an additional "supervisor" process that just checks the necessary demons are up and running, and restarts them as necessary.

One option is to simply choose a lightweight WSGI server from this list:
http://wsgi.org/wsgi/Servers
and let it do the work of a long-running process that serves requests. (I would recommend Spawning.) Your code can concentrate on the REST API and handling requests through the well defined WSGI interface, and scheduling jobs.
There are at least a couple of scheduling libraries you could use, but I don't know much about them:
http://sourceforge.net/projects/pycron/
http://code.google.com/p/scheduler-py/

Here's what we did.
Wrote a simple, pure-wsgi web application to respond to REST requests.
Start jobs
Report status of jobs
Extended the built-in wsgiref server to use the select module to check for incoming requests.
Activity on the socket is ordinary REST request, we let the wsgiref handle this.
It will -- eventually -- call our WSGI applications to respond to status and
submit requests.
Timeout means that we have to do two things:
Check all children that are running to see if they're done. Update their status, etc.
Check a crontab-like schedule to see if there's any scheduled work to do. This is a SQLite database that this server maintains.

I usually use cron for scheduling. As for REST you can use one of the many, many web frameworks out there. But just running SimpleHTTPServer should be enough.
You can schedule the REST service startup with cron #reboot
#reboot (cd /path/to/my/app && nohup python myserver.py&)

The usual design pattern for a scheduler would be:
Maintain a list of scheduled jobs, sorted by next-run-time (as Date-Time value);
When woken up, compare the first job in the list with the current time. If it's due or overdue, remove it from the list and run it. Continue working your way through the list this way until the first job is not due yet, then go to sleep for (next_job_due_date - current_time);
When a job finishes running, re-schedule it if appropriate;
After adding a job to the schedule, wake up the scheduler process.
Tweak as appropriate for your situation (eg. sometimes you might want to re-schedule jobs to run again at the point that they start running rather than finish).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.