Faster alternative to apache airflow for workflows with many tasks

Faster alternative to apache airflow for workflows with many tasks - python

I currently use Apache Airflow for running data aggregation and ETL workflows. My workflows are fairly complex with one workflow having 15-20 tasks and have branches. I can combine them but doing so would negate the features like retry, execution timers that I use. Airflow works well except that it is quite slow with so many tasks. It takes lot of time between tasks.
Is there an alternative which can execute the tasks faster without gaps in between tasks? I also would like to minimize the effort needed to switch over if possible.

I would recommend Temporal Workflow. It has more developer friendly programming model and scales to orders of magnitude larger use cases. It also already used for multiple latency sensitive applications at many companies.
Disclaimer: I'm the tech lead of the Temporal project and the Co-founder/CEO of the associated company.

I would recommend that you try out Dataplane. It is an Airflow alternative written in Golang to achieve super fast performance and can scale with far less resources. It has a built in Python code editor with a drag and drop data pipeline builder. It also has segregated environments so you can build your route to live or different data domains to construct a data mesh. It is totally free to use.
Here is the link: https://github.com/dataplane-app/dataplane
Disclaimer: I am part of the community that actively contributes towards Dataplane.

Related

Persistent dataflows with dask

I am interested to work with persistent distributed dataflows with features similar to the ones of the Pegasus project: https://pegasus.isi.edu/ for example.
Do you think there is a way to do that with dask?
I tried to implement something which works with a SLURM cluster and dask.
I will below describe my solution in great lines in order to better specify my use case.
The idea is to execute medium size tasks (that run between few minutes to hours) which are specified with a graph which can have persistency and can easily been extended.
I implemented something based on dask's scheduler and its graph api.
In order to have persistency, I wrote two kind of decorators:
one "memoize" decorator that permits to serialize in a customizable way complexe arguments, and also the results, of the functions (a little bit like dask do with cachey or chest, or like spark does with its RDD objects) and
one "delayed" decorator that permits to execute functions on a cluster (SLURM). In practice the API of functions is modified in order that they take jobids of dependencies as arguments and return the jobid of the created job on the cluster. Also the functions are serialized in a text file "launch.py" wich is launched with the cluster's command line API.
The association taskname-jobid is saved in a json file which permits to manage persistency using status of the task returned by the cluster.
This way to work permits to have a kind of persistency of the graph.
It offer the possibility to easily debug tasks that failed.
The fact to use a serialization mechanism offer the possibility to easily access to all intermediate results, even without the whole workflow and/or the functions that generated them.
Also, in this way it is easy to interact with legacy applications that do not use that kind of dataflow mechanism.
This solution is certainly a little bit naive compared to other, more modern, ways to execute distributed workflows with dask and distributed but it seems to me to have some advantages du to its persistency (of tasks and data) capabilities.
I'm intersted to know if the solution seems pertinent or not and if it seems to describe an interesting, not adressed, use case by dask.
If someone can recommand me some other ways to do, I am also interested!

Recommendation Engine for simple but data heavy web app

Problem:
I'm currently using python-recsys and SVD algorithm to compute recommendations for my users. Computation is rather quick (for now) but I'm wondering how this would behave if we go live. we have around 1 million products stored in Mongodb and are expecting around 100 users for start. I've simulated situations like that, but this random generated data does not actually apply to real cases.
We use Redis for recommendations storage, they're computed every 2 hours in celery tasks and currently are really memory heavy, although I've made my best to optimize them.
Worrying about future I'm planning to use Neo4j for that task although it's pretty hard to find any real life stories of developers using this db for recommendations.
Generally what I'd like to achieve is reasonably working recommendation engine (mahout would be overkill in that case i guess) which is not really memory because we cannot afford many servers.
How would Neo4j play with that problem ? Are there any good python drivers for that db ? Maybe it'd be better to use current Mongodb/Redis solution and tune it a little and not add another db to current stack ? I was also considering usning separate machine for just pure computation of recommendations - but is it a good choice?

Worrying about future I'm planning to use Neo4j for that task although
it's pretty hard to find any real life stories of developers using
this db for recommendations.
http://seenickcode.com/switching-from-mongodb-to-neo4j/
How would Neo4j play with that problem ? Are there any good python
drivers for that db ?
http://neo4j.com/developer/python/

Harvesting the power of highly-parallel computers with python scientific code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I run into the following problem when writing scientific code with Python:
Usually you write the code iteratively, as a script which perform some computation.
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).
I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.
Any existing solution? Anyone wishes to share his code which do just that? Thanks!

I might be wrong, but simply using GNU command line utilities, like parallel, or even xargs, seems appropriate to me for this case. Usage might look like this:
cat inputs | parallel ./job.py --pipe > results 2> exceptions
This will execute job.py for every line of inputs in parallel, output successful results into results, and failed ones to exceptions. A lot of examples of usage (also for scientific Python scripts) can be found in this Biostars thread.
And, for completeness, Parallel documentation.

First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.
So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!

If you with "have access to a ~100 CPUs machines" mean that you have access to 100 machines each having multiple CPUs and in case you want a system that is generic enough for different kinds of applications, then the best possible (and in my opinion only) solution is to have a management layer between your resources and your job input. This is nothing Python-specific at all, it is applicable in a much more general sense. Such a layer manages the computing resources, assigns tasks to single resource units and monitors the entire system for you. You make use of the resources via a well-defined interface as provided by the management system. Such as management system is usually called "batch system" or "job queueing system". Popular examples are SLURM, Sun Grid Engine, Torque, .... Setting each of them up is not trivial at all, but also your request is not trivial.
Python-based "parallel" solutions usually stop at the single-machine level via multiprocessing. Performing parallelism beyond a single machine in an automatic fashion requires a well-configured resource cluster. It usually involves higher level mechanisms such as the message passing interface (MPI), which relies on a properly configured resource system. The corresponding configuration is done on the operating system and even hardware level on every single machine involved in the resource pool. Eventually, a parallel computing environment involving many single machines of homogeneous or heterogeneous nature requires setting up such a "batch system" in order to be used in a general fashion.
You realize that you don't get around the effort in properly implementing such a resource pool. But what you can do is totally isolate this effort form your application layer. You once implement such a managed resource pool in a generic fashion, ready to be used by any application from a common interface. This interface is usually implemented at the command line level by providing job submission, monitoring, deletion, ... commands. It is up to you to define what a job is and which resources it should consume. It is up to the job queueing system to assign your job to specific machines and it is up to the (properly configured) operating system and MPI library to make sure that the communication between machines is working.
In case you need to use hardware distributed among multiple machines for one single application and assuming that the machines can talk to each other via TCP/IP, there are Python-based solutions implementing so to say less general job queueing systems. You might want to have a look at http://python-rq.org/ or http://itybits.com/pyres/intro.html (there are many other comparable systems out there, all based on an independent messaging / queueing instance such as Redis or ZeroMQ).

Usually you write the code iteratively, as a script which perform some computation.
This makes me think you'd really like ipython notebooks
A notebook is a file that has a structure which is a mix between a document and an interactive python interpreter. As you edit the python parts of the document they can be executed and the output embedded in the document. It's really good programming where you're exploring the problem space, and want to make notes as you go.
It's also heavily integrated with matplotlib, so you can display graphs inline. You can embed Latex math inline, and many media objects types such as pictures and video.
Here's a basic example, and a flashier one
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
This makes me think you'd really like ipython clusters
iPython clusters allow you to run parallel programs across multiple machines. Programs can either be SIMD (which sound like your case) or MIMD style. Programs can be edited and debugged interactively.
There were several talks about iPython at the recent SciPy event. Going onto PyVideo.org and searching gives numerous videos, including:
Using IPython Notebook with IPython Cluster
IPython in-depth: high-productivity interactive and parallel python
IPython in Depth, SciPy2013 Tutorial Part 1 / 2 / 3
I not watched all of these, but they're probably a good starting point.

I am interested in disproving some graph theory conjectures in python, what is the most efficient library/server set up to use?

I am interested in implementing and running some heavy graph-theory algorithms for the purpose of (hopefully) finding counterexamples for some conjecture.
What is the most efficient libraries, server setups you would recommend?
I am thinking of using Python's Graph API.
For running the algorithms I was thinking of using Hadoop, but researching Hadoop I get the feeling it is more appropriate for analysing databases than enumerating problems.
If my thinking about Hadoop is correct, what is the best server setup you would recommend for running such a process?
Any leads on how to run an algorithm in a remote distributed environment that won't require a lot of code rewritting or cost a lot of money would be helpful.
many thanks!

You can look CUDA as another option, if it is highly computational task.

You could have a look on neo4j which is a no-sql graph database. If your scalability constraints are strong, it could be a good choice.
Interface is REST based, but some python bindings exist too (see here)
You can have a look here for a blog with some graph theory applications ( a small study on scalability could be found here ).

What's a good starting point to design an architecture with scalability in mind?

I'm currently about to start designing a new application.
The application will allow a user to insert some data and will provide data analysis (with reports as well), i know it's not helpful but the data-processing will be done in post-processing so that's not really interesting for the front-end.
I'd like to start with the right path to help myself when there will be the need to scale to handle more users.
I'm thinking about PostgreSQL to store the data, because I've already used it and I like it (also if a NoSQL would be a good choice -since not all data needs to have a relation- I like the Postgres support and community and I feel better knowing that there's a big community out there to help me), MySQL (innodb) is also a good choice, tbh I've not a real reason to choose it over PostgreSQL and vice-versa (is maybe MySQL easier to create shards?).
I know several programming languages but my strengths are Python, C/C++, Javascript.
I'm not sure if I should choose a sync or async approach for this task (I could scale out by running more sync applications behind a load balancer).
I've already developed another big-size project that teached me a lot of things about concurrency, but there each choice was influenced according to the (whole rest of the team, but mostly by the) sysadmin skills, so we have used python (django) + uwsgi + nginx.
For this project (since it's totally different from the other - that was an e-commerce, this is such a SaaS) I was also considering to make use of node.js, it would be a good opportunity to try it out in a serious project.
The most heavy data processing would be done by post-processes so all the front-end (user website) would be mostly I/O (+1 to use an async enviroment).
What would you suggest?
ps. I must also keep in mind that first of all the project has to start, so I cannot only think about each possible design, but I should start writing code ASAP :-)
My current thoughts are:
- start with something you know
- keep it as simple as possibile
- track everything to find bottlenecks
- scale out
So it wouldn't really matter if I deploy sync or async, but I know async has much better performances, and each thing that could help me to get better results (ergo lower costs) is evaluable as well.
I'm curious to know what are your experiences (also with other technologies)...
I'm becoming paranoid about this scalability and I fear it could lead to a wrong design (it's also the first time I'm designing alone for a commercial purpose = FUD)
If you need some more info please let me know and I'll try go give to you an answer.
Thanks.

A good resource for all of this is http://highscalability.com/. Lots of interesting case studies about handling big web loads.
You didn't mention it but you might want to think about hosting it in the cloud (Azure, Amazon, etc). Makes scaling the hardware a little easier and it's especially nice if your demand fluctuates.

Here are some basic guidelines:
Use as much async processes as possible. Or atleast design it in such a way that it can be converted to be async.
Design processes such that they can be segregated on different servers. This also goes to above. Say you have a webapp that has some intensive processes. If this process is asynch; then the main webserver could queue the job and be done with. Then a seperate server could pick the job and process it. This way your main web servers are not affected. But if you are resource constrained, you could still run the background process on same server (till you have enough clients and then you can spawn it off to a diff. server)
Design for load balancing. So if you app useses sessions, then you should factor in how you will be replicating sessions or not. You dont have to - you could send the user to a diff. server and then forward all subsequent requests to that server. But you still have to design for it.
Have the ability to route load to different servers based on some predefined criteria. So for eg: since your app is a SAAS app, you could decide that certain clients will go to Environment1 and certain other clients will go to Environment2. Lot of the SAAS players do this. For eg Salesforce.
You dont necessarily have to do this from the get go - but having this ability will go a long way to scale your app when the time comes.
Also, remember that theses approaches are not exclusive. You should design your app for all these approaches; but only implement it when required.
Take a look at the book The Art of Scalability
This book was written by guys that worked with eBay & Paypal.

Tale a look at this excellent presentation on scalability patterns and approaches.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.