I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.
Are there any frameworks in python that can handle this?
I see several core functions:
DAG bulding
Execution of nodes (run shell cmd with wait,logging etc.)
Ability to rebuild sub-graph in parent DAG during execution
Ability to manual execute nodes or sub-graph while parent graph is running
Suspend graph execution while waiting for external event
List job queue and job details
Something like Oozie, but more general purpose and in python.
1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.
2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Both of the two modules are mainly written in Python and web interfaces are included for convenient management.
As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.
3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.
4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.
Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html
There are a ton of these; everyone seems to write their own. There is a good list at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems. Which includes systems that originate in both industry and academia.
Have you looked at Ruffus?
I have no experience with it but it appears to do some of the items on your list. It also looks quite hackable so you might be able to implement your other requirements yourself.
Related
I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.
The technology I would like to use in this example is Celery for queueing and python for component implementation.
Imagine a simple project hat exists of 2 components. One is a web app that connects to an API and gathers data. Component 2 is a processor that can then process the data. When the web app has gotten a piece of data from the API it is supposed to send a task into a task queue including the just crawled data which is then consumed by the processor to process the Data.
Whether or not this is a sensible way to go about a task like this is debatable and not the point of my question.
My question is, the tasks to process things are defined within the processor since they state what processing function shall be executed and the definition of that function is obviously within the processor. Now that the web app doesn't have access to the task definition how does he communicate the task to the processor?
Do you have to hold a copy of the source code of the processor within the web app?
Do you make the processor a dependency of the web app?
What is the best practice approach to handle such a scenario?
What you are describing is probably one of the most common use-cases for Celery. Just look how many people are asking Django/Flask + Celery questions here on StackOverflow... If you are a Django user, there is an entire section in the Celery documentation describing how to do exactly what you want. Things should be similar with other frameworks.
Do you have to hold a copy of the source code of the processor within the web app?
As far as I know you do not have to (I do not use any web framework) but it could be that you do need to because of some deeper integration with Celery. If your web application knows the Celery task name, and its parameters, it can schedule it to run without actually having access to the Python code. This is accomplished using send_task(task_name, ...).
Do you make the processor a dependency of the web app?
As I wrote above there are several ways to use it. If you want tighter integration then yes. If you just want to run task and get result using the send_task() than your web application should only depend on Celery.
What is the best practice approach to handle such a scenario?
Follow the Django guide. I advise you to run Celery independently, run some tasks, just so you learn about basic principles how it distributes the work, etc.
I am trying to do some tasks in django that consume alot of time. For that, I will be running background tasks.
After some R&D, i have found two solutions:
Celery with RabbitMQ.
Django Background tasks.
Both options seem to fulfill the criteria but setting up Celery will require some work. Now as far as the second option is concerned, setup is fairly simple and in fairly quick amount of time, i can go on writing background tasks. Now my questions if i adopt the 2nd option is this:
How well does Django Background tasks perform ? (Scalability wise in Production environment).
Can i poll the tasks (after some time) in DB to check the task's status?
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?)
Again coming to the first point, how well does Django Background tasks perform in production. (Talking about prior experience of using this in prod.)
Setting up celery takes work (although less when using Redis). It's also serious tool with almost a decade of investment and widespread industry adoption.
As for performance, scaling behaviors of task systems which are backed by queues vs those backed by RDBMs are well understood – but may not be relevant to you as "scalability" is a very subjective term. This thread provides some good framing on the subject and questions.
Comparing stars on GitHub (bg tasks' 3XX vs Celery's 13XXX), you should realize Django-Background-tasks has a smaller user base, and you're probably going to need to get into the internals to understand the architecture and precise mechanics. That shouldn't stop you – just be prepared to DIY when answers aren't forthcoming.
How well does Django Background tasks perform ? - This will depend upon how and what you implement. One thing to note is, Django-background-tasks is based upon database where celery can have redis/rabbitmq as backend, so most probably we'll see considerable performance difference here.
Can I poll the tasks (after some time) in DB to check the task's status? - It's possible in celery and maybe you can find a solution by inspecting django-background-tasks internal code. But one thing is, we can abort celery task, which maybe not possible in Django-Background-tasks.
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?) - It's simple Django based project. You can have a look at code. It's seems to be pretty straightforward.
Again coming to the first point, how well does Django Background tasks perform in production. - Haven't used in production. But since Django-Background-tasks is database based and celery can be configured to use redis/rabbitmq - I think celery have a plus point here.
To me this comparison, seems to be link comparing pistol with a high-end automatic machine guns. Both do same job. But one simple straightforward - other little complicated but with lots of options and scope.
Choose based on your use case.
I have decided to use Django-Background-Tasks. Let me clarify my motivations.
The tasks that will be processed by Django-Background-Tasks doesn't need to be processed in a fast manner. As it is stated by the name, they are background tasks. I accept delays.
The architecture of Django-Background-Tasks is very simple. When you call a method to be process in the background in your code a task record is inserted to the Django-Background-Tasks tables in your database. And the method you called is not executed actually. It is proxied. Then you should trigger another process to execute the jobs. Your method is then executed in this process.
The process that execute jobs can be executed by a cron entry in your server.
Since this setup is so easy and work for I decided to use Django-Background-Tasks. But If I needed something more responsive and fast I would use Celery since it is using memory and there is an active process that processes the jobs. Which isn't the case in Django-Background-Tasks.
we are trying to solve a problem related to cluster job scheduler.
The problem is the following we have a set of python scripts which are executed in a cluster, the launching process is currently done by means of the human interaction, I mean to start the test we have a bash script which interact with the cluster to request the resources needed for the execution. What we are intending to do is to build an automatic launching process (which should be sound in the sense that it realizes the job status and based on that wait the job ending, restart the execution, etc...). Basically we have to implement a layer between the user workstation and the cluster.
Another additional difficulty is that our layer must be clever enough to interact with the different cluster job schedulers. We wonder if there exists a tool or framework which help us to interact with the cluster without having to deal with each cluster scheduler details. We have searched in the web but we did not find anything suitable for our needs.
By the way the programming language we use is Python.
Thanks in advance!
Br.-
Use supervisor: http://supervisord.org/
and celery http://www.celeryproject.org/
together
Take a look at the ipcluster_tools. The documentation is sparse but it is easy to use.
I want to write a long running process (linux daemon) that serves two purposes:
responds to REST web requests
executes jobs which can be scheduled
I originally had it working as a simple program that would run through runs and do the updates which I then cron’d, but now I have the added REST requirement, and would also like to change the frequency of some jobs, but not others (let’s say all jobs have different frequencies).
I have 0 experience writing long running processes, especially ones that do things on their own, rather than responding to requests.
My basic plan is to run the REST part in a separate thread/process, and figured I’d run the jobs part separately.
I’m wondering if there exists any patterns, specifically python, (I’ve looked and haven’t really found any examples of what I want to do) or if anyone has any suggestions on where to begin with transitioning my project to meet these new requirements.
I’ve seen a few projects that touch on scheduling, but I’m really looking for real world user experience / suggestions here. What works / doesn’t work for you?
If the REST server and the scheduled jobs have nothing in common, do two separate implementations, the REST server and the jobs stuff, and run them as separate processes.
As mentioned previously, look into existing schedulers for the jobs stuff. I don't know if Twisted would be an alternative, but you might want to check this platform.
If, OTOH, the REST interface invokes the same functionality as the scheduled jobs do, you should try to look at them as two interfaces to the same functionality, e.g. like this:
Write the actual jobs as programs the REST server can fork and run.
Have a separate scheduler that handles the timing of the jobs.
If a job is due to run, let the scheduler issue a corresponding REST request to the local server.
This way the scheduler only handles job descriptions, but has no own knowledge how they are implemented.
It's a common trait for long-running, high-availability processes to have an additional "supervisor" process that just checks the necessary demons are up and running, and restarts them as necessary.
One option is to simply choose a lightweight WSGI server from this list:
http://wsgi.org/wsgi/Servers
and let it do the work of a long-running process that serves requests. (I would recommend Spawning.) Your code can concentrate on the REST API and handling requests through the well defined WSGI interface, and scheduling jobs.
There are at least a couple of scheduling libraries you could use, but I don't know much about them:
http://sourceforge.net/projects/pycron/
http://code.google.com/p/scheduler-py/
Here's what we did.
Wrote a simple, pure-wsgi web application to respond to REST requests.
Start jobs
Report status of jobs
Extended the built-in wsgiref server to use the select module to check for incoming requests.
Activity on the socket is ordinary REST request, we let the wsgiref handle this.
It will -- eventually -- call our WSGI applications to respond to status and
submit requests.
Timeout means that we have to do two things:
Check all children that are running to see if they're done. Update their status, etc.
Check a crontab-like schedule to see if there's any scheduled work to do. This is a SQLite database that this server maintains.
I usually use cron for scheduling. As for REST you can use one of the many, many web frameworks out there. But just running SimpleHTTPServer should be enough.
You can schedule the REST service startup with cron #reboot
#reboot (cd /path/to/my/app && nohup python myserver.py&)
The usual design pattern for a scheduler would be:
Maintain a list of scheduled jobs, sorted by next-run-time (as Date-Time value);
When woken up, compare the first job in the list with the current time. If it's due or overdue, remove it from the list and run it. Continue working your way through the list this way until the first job is not due yet, then go to sleep for (next_job_due_date - current_time);
When a job finishes running, re-schedule it if appropriate;
After adding a job to the schedule, wake up the scheduler process.
Tweak as appropriate for your situation (eg. sometimes you might want to re-schedule jobs to run again at the point that they start running rather than finish).