Persistent dataflows with dask

Persistent dataflows with dask - python

I am interested to work with persistent distributed dataflows with features similar to the ones of the Pegasus project: https://pegasus.isi.edu/ for example.
Do you think there is a way to do that with dask?
I tried to implement something which works with a SLURM cluster and dask.
I will below describe my solution in great lines in order to better specify my use case.
The idea is to execute medium size tasks (that run between few minutes to hours) which are specified with a graph which can have persistency and can easily been extended.
I implemented something based on dask's scheduler and its graph api.
In order to have persistency, I wrote two kind of decorators:
one "memoize" decorator that permits to serialize in a customizable way complexe arguments, and also the results, of the functions (a little bit like dask do with cachey or chest, or like spark does with its RDD objects) and
one "delayed" decorator that permits to execute functions on a cluster (SLURM). In practice the API of functions is modified in order that they take jobids of dependencies as arguments and return the jobid of the created job on the cluster. Also the functions are serialized in a text file "launch.py" wich is launched with the cluster's command line API.
The association taskname-jobid is saved in a json file which permits to manage persistency using status of the task returned by the cluster.
This way to work permits to have a kind of persistency of the graph.
It offer the possibility to easily debug tasks that failed.
The fact to use a serialization mechanism offer the possibility to easily access to all intermediate results, even without the whole workflow and/or the functions that generated them.
Also, in this way it is easy to interact with legacy applications that do not use that kind of dataflow mechanism.
This solution is certainly a little bit naive compared to other, more modern, ways to execute distributed workflows with dask and distributed but it seems to me to have some advantages du to its persistency (of tasks and data) capabilities.
I'm intersted to know if the solution seems pertinent or not and if it seems to describe an interesting, not adressed, use case by dask.
If someone can recommand me some other ways to do, I am also interested!

Related

Faster alternative to apache airflow for workflows with many tasks

I currently use Apache Airflow for running data aggregation and ETL workflows. My workflows are fairly complex with one workflow having 15-20 tasks and have branches. I can combine them but doing so would negate the features like retry, execution timers that I use. Airflow works well except that it is quite slow with so many tasks. It takes lot of time between tasks.
Is there an alternative which can execute the tasks faster without gaps in between tasks? I also would like to minimize the effort needed to switch over if possible.

I would recommend Temporal Workflow. It has more developer friendly programming model and scales to orders of magnitude larger use cases. It also already used for multiple latency sensitive applications at many companies.
Disclaimer: I'm the tech lead of the Temporal project and the Co-founder/CEO of the associated company.

I would recommend that you try out Dataplane. It is an Airflow alternative written in Golang to achieve super fast performance and can scale with far less resources. It has a built in Python code editor with a drag and drop data pipeline builder. It also has segregated environments so you can build your route to live or different data domains to construct a data mesh. It is totally free to use.
Here is the link: https://github.com/dataplane-app/dataplane
Disclaimer: I am part of the community that actively contributes towards Dataplane.

Parallelize Python code on a cluster [duplicate]

I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.
What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?
If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.

If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.
What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).
See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.
From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).
If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.

In the past I've used Pyro to do this quite successfully. If you turn on mobile code it will automatically send over the wire required modules the nodes don't have already. Pretty nifty.

I have luck using SCOOP as an alternative to multiprocessing for single or multi computer use and gain the benefit of job submission for clusters as well as many other features such as nested maps and minimal code changes to get working with map().
The source is available on Github. A quick example shows just how simple implementation can be!

If you are willing to pip install an open source package, you should consider Ray, which out of the Python cluster frameworks is probably the option that comes closest to the single threaded Python experience. It allows you to parallelize both functions (as tasks) and also stateful classes (as actors) and does all of the data shipping and serialization as well as exception message propagation automatically. It also allows similar flexibility to normal Python (actors can be passed around, tasks can call other tasks, there can be arbitrary data dependencies, etc.). More about that in the documentation.
As an example, this is how you would do your multiprocessing map example in Ray:
import ray
ray.init()
#ray.remote
def mapping_function(input):
return input + 1
results = ray.get([mapping_function.remote(i) for i in range(100)])
The API is a little bit different than Python's multiprocessing API, but should be easier to use. There is a walk-through tutorial that describes how to handle data-dependencies and actors, etc.
You can install Ray with "pip install ray" and then execute the above code on a single node, or it's also easy to set up a cluster, see Cloud support and Cluster support
Disclaimer: I'm one of the Ray developers.

Using the multiprocessing module for cluster computing

I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.
What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?
If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.

If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.
What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).
See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.
From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).
If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.

In the past I've used Pyro to do this quite successfully. If you turn on mobile code it will automatically send over the wire required modules the nodes don't have already. Pretty nifty.

I have luck using SCOOP as an alternative to multiprocessing for single or multi computer use and gain the benefit of job submission for clusters as well as many other features such as nested maps and minimal code changes to get working with map().
The source is available on Github. A quick example shows just how simple implementation can be!

If you are willing to pip install an open source package, you should consider Ray, which out of the Python cluster frameworks is probably the option that comes closest to the single threaded Python experience. It allows you to parallelize both functions (as tasks) and also stateful classes (as actors) and does all of the data shipping and serialization as well as exception message propagation automatically. It also allows similar flexibility to normal Python (actors can be passed around, tasks can call other tasks, there can be arbitrary data dependencies, etc.). More about that in the documentation.
As an example, this is how you would do your multiprocessing map example in Ray:
import ray
ray.init()
#ray.remote
def mapping_function(input):
return input + 1
results = ray.get([mapping_function.remote(i) for i in range(100)])
The API is a little bit different than Python's multiprocessing API, but should be easier to use. There is a walk-through tutorial that describes how to handle data-dependencies and actors, etc.
You can install Ray with "pip install ray" and then execute the above code on a single node, or it's also easy to set up a cluster, see Cloud support and Cluster support
Disclaimer: I'm one of the Ray developers.

How should I configure Amazon EC2 to perform parallelizable data-intensive calculations?

I have a computational intensive project that is highly parallelizable: basically, I have a function that I need to run on each observation in a large table (Postgresql). The function itself is a stored python procedure.
Amazon EC2 seems like an excellent fit for the project.
My question is this: Should I make a custom image (AMI) that already contains the database? This would seem to have the advantage of minimizing data transfers and making parallelization simple: each image could get some assigned block of indices to compute, e.g., image 1 gets 1:100, image 2 101:200 etc. Splitting up the data and the instances (which most how-to guides suggest) doesn't seem to make sense for my application, but I'm very new to this so I'm not confident my intuition is right.

you will definitely want to keep the data and the server instance separate in order for changes in your data to be persisted when you are done with the instance. your best bet will be to start with a basic image that has the OS & database platform you want to use, customize it to suit your needs, and then mount one or more EBS volumes containing your data. You may also want to create your own server instance once you are finished with your customization, unless what you are doing is fairly straightforward.
some helpful links:
http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/creating-an-image.html
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=100&externalID=1663
(you said postgres but this mysql tutorial covers the same basic concepts you'll want to keep in mind)

If you've already got the function implemented in Python, the simplest route might be to look at PiCloud, which just gives you a really easy interface for running a Python function on EC2, handling pretty much everything else for you. Whether it's economically sensible will depend on how much data has to get sent per function call vs how long computations take to run.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.

Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.

I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.

I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1

First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.

This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.

Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.