Dataflow computing in python

Dataflow computing in python - python

I have n (typically n < 10 but it should scale) processes running on different machines and communicating through amqp using RabbitMQ. Processes are typically long running and may be implemented in any language (though most are java/python).
Each process requires a number of inputs (numbers/strings) and produces a number of outputs (also just numbers or strings). Executing a process happens asynchronously: sending a message on its input queue and waiting for a callback to be triggered by the output queue.
Ideally the user specifies some inputs and desired outputs and the system should:
detect which processes are needed and generate the dependency graph
topologically sort the graph and execute it, node transitions will need to be event driven
A node should fire if its input is ready, allowing parallelism per branch. I can assume no cycles for now, but eventually there will be cycles (e.g., two processes may need to iterate until the output no longer changes).
This should be a known problem from (data)flow programming (discussed here before) and I want to avoid re-inventing the wheel. I would prefer a python solution and a search leads to Trellis and Pypes. Trellis is no longer developed but seems to support cycles, while pypes does not. Also not sure how actively developed pypes is.
Further searches reveal a whole list of event based programming frameworks, none of which I am particularly knowledgeable about. There are of course workflow environments like Taverna and KNIME, but that seems overkill.
Does anybody have any experience tackling this type of problem or with the libraries mentioned?
Edit: Other libraries I found are:
Stream
zflow
pyf
javafbp (Java)

python.org has a Wiki page on "Flow Based Programming" -- http://wiki.python.org/moin/FlowBasedProgramming

The bottom line is that if you can reinvent the wheel in a small number of lines of code ( a few hundred) which you completely understand and can document, then do it.
This is an area where the abstractions used are not that hard to implement given some basic foundation tools. RabbitMQ is such a tool. Node.js is another. There are lots of libraries around that implement useful ways to manages dataflows, workflows, finite state machines, etc., but they have a lot of overlap and they tend to be incomplete. Probably the original developer just built enough to get over his initial problem, and since this type of programming was not that popular, there was not the critical mass to keep development going.
There is a lot to be said for ranking all the possible solutions by popularity, picking the most popular one, and putting your effort into making it work (while sharing your work, of course).

Related

Cleanest method of inter-program communication on separate devices

I need to settle on a method of communication between multiple programs spread out over multiple machines. The data itself is fairly straight forwards, comprising of variable length vectors with some meta-data descriptors.
Each program "type" will send data to 1 other program type, and expects to see a reply from it.
The number of connections a given program has is variable over time, and programs can be added or removed at any moment.
Programs may be distributed across several processors, which may use different operating systems, or could be microprocessors.
I have not had to really contend with such a inter-communication puzzle before and am unsure of what the cleanest approach would be. I've heard that ROS might be useful, but that it may not be suitable for Windows environments and has a steep learning curve. I can imagine creating a database between the nodes which would act as a kind of blackboard, but this feels like it may be very inefficient. If using sockets would resolve efficiencies, then how would the connections be managed / maintained?
My main development environment at the moment is Julia, but I can port the data over to C++ or Python without too much discomfort.
I am open to ideas and learning new things, but I can't figure out which path to take. Any nuggets of wisdom would be greatly appreciated.
As requested in a comment, here is some additional information:
Data: The data is a vector of floating point numbers, which can vary in size anywhere from 1 value to 1000's of values (maybe more depending on other aspects of the project, but I'm capping it for prototyping purposes). The meta-data is in json format and provides information such as the total length of the array, and a few identifiers (all integers).
Reliability: I can handle a few missed messages here and there. I'd need the data in the order it was sent though. So if a packet of data arrives after a more recent one it will be ignored. (In case you were instead referring to data integrity, then the data must not be corrupted in transit).
Latency: As little as possible as the program could start to introduce errors otherwise. That being said, for prototyping purposes I can persevere with 10's of milliseconds average, and irregular 100's of milliseconds. I would very much like to keep this down though.
Standard IP Facilities: You may assume all devices have these, yes.

Perhaps taking a look at something well established (and therefore should be able to hook up to whichever language/s you end up using) like Redis or RabbitMQ would be wise seeing as you'll need to be able to communicate with a number of different devices over a network. Either way having some form of central hub that handles communication among the other devices would be advisable.

Harvesting the power of highly-parallel computers with python scientific code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I run into the following problem when writing scientific code with Python:
Usually you write the code iteratively, as a script which perform some computation.
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).
I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.
Any existing solution? Anyone wishes to share his code which do just that? Thanks!

I might be wrong, but simply using GNU command line utilities, like parallel, or even xargs, seems appropriate to me for this case. Usage might look like this:
cat inputs | parallel ./job.py --pipe > results 2> exceptions
This will execute job.py for every line of inputs in parallel, output successful results into results, and failed ones to exceptions. A lot of examples of usage (also for scientific Python scripts) can be found in this Biostars thread.
And, for completeness, Parallel documentation.

First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.
So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!

If you with "have access to a ~100 CPUs machines" mean that you have access to 100 machines each having multiple CPUs and in case you want a system that is generic enough for different kinds of applications, then the best possible (and in my opinion only) solution is to have a management layer between your resources and your job input. This is nothing Python-specific at all, it is applicable in a much more general sense. Such a layer manages the computing resources, assigns tasks to single resource units and monitors the entire system for you. You make use of the resources via a well-defined interface as provided by the management system. Such as management system is usually called "batch system" or "job queueing system". Popular examples are SLURM, Sun Grid Engine, Torque, .... Setting each of them up is not trivial at all, but also your request is not trivial.
Python-based "parallel" solutions usually stop at the single-machine level via multiprocessing. Performing parallelism beyond a single machine in an automatic fashion requires a well-configured resource cluster. It usually involves higher level mechanisms such as the message passing interface (MPI), which relies on a properly configured resource system. The corresponding configuration is done on the operating system and even hardware level on every single machine involved in the resource pool. Eventually, a parallel computing environment involving many single machines of homogeneous or heterogeneous nature requires setting up such a "batch system" in order to be used in a general fashion.
You realize that you don't get around the effort in properly implementing such a resource pool. But what you can do is totally isolate this effort form your application layer. You once implement such a managed resource pool in a generic fashion, ready to be used by any application from a common interface. This interface is usually implemented at the command line level by providing job submission, monitoring, deletion, ... commands. It is up to you to define what a job is and which resources it should consume. It is up to the job queueing system to assign your job to specific machines and it is up to the (properly configured) operating system and MPI library to make sure that the communication between machines is working.
In case you need to use hardware distributed among multiple machines for one single application and assuming that the machines can talk to each other via TCP/IP, there are Python-based solutions implementing so to say less general job queueing systems. You might want to have a look at http://python-rq.org/ or http://itybits.com/pyres/intro.html (there are many other comparable systems out there, all based on an independent messaging / queueing instance such as Redis or ZeroMQ).

Usually you write the code iteratively, as a script which perform some computation.
This makes me think you'd really like ipython notebooks
A notebook is a file that has a structure which is a mix between a document and an interactive python interpreter. As you edit the python parts of the document they can be executed and the output embedded in the document. It's really good programming where you're exploring the problem space, and want to make notes as you go.
It's also heavily integrated with matplotlib, so you can display graphs inline. You can embed Latex math inline, and many media objects types such as pictures and video.
Here's a basic example, and a flashier one
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
This makes me think you'd really like ipython clusters
iPython clusters allow you to run parallel programs across multiple machines. Programs can either be SIMD (which sound like your case) or MIMD style. Programs can be edited and debugged interactively.
There were several talks about iPython at the recent SciPy event. Going onto PyVideo.org and searching gives numerous videos, including:
Using IPython Notebook with IPython Cluster
IPython in-depth: high-productivity interactive and parallel python
IPython in Depth, SciPy2013 Tutorial Part 1 / 2 / 3
I not watched all of these, but they're probably a good starting point.

Would python be an appropriate choice for a video library for home use software

I am thinking of creating a video library software which keep track of all my videos and keep track of videos that I already haven't watched and stats like this. The stats will be specific to each user using the software.
My question is, is python appropriate to create this software or do I need something like c++.

Python is perfectly appropriate for such tasks - indeed the most popular video site, YouTube, is essentially programmed in Python (using, of course, lower-level components called from Python for such tasks as web serving, relational db, video transcoding -- there are plenty of such reusable opensource components for all these kinds of tasks, but your application's logic flow and all application-level logic can perfectly well be in Python).
Just yesterday evening, at the local Python interest group meeting in Mountain View, we had new members who just moved to Silicon Valley exactly to take Python-based jobs in the video industry, and they were saying that professional level video handing in the industry is also veering more and more towards Python -- stalwarts like Pixar and ILM had been using Python forever, but in the last year or two it's been a flood of Python adoption in the industry.

If you want your code to be REAL FAST, use C++ (or parallel fortran).
However in your application, 99% of the runtime isn't going to be in YOUR code, it's going to be in GUI libraries, OS calls, waiting for user interaction, calling libraries (written in C) to open video files and make thumbnails, that kind of stuff.
So using C++ will make your code 100 times faster, and your application will, as a result, be 1% faster, which is utterly useless. And if you write it in C++ you'll need months, whereas using Python you'll be finished much faster and have lots more fun.
Using C++ could even make it a lot slower, because in Python you can very easily build more scalable algorithms by using super powerful primitives like hashes, sets, generators, etc, try several algorithms in 5 minutes to see which one is the best, import a library which already does 90% of the work, etc.
Write it in Python.

Yes. Python is much easier to use than c++ for something like this. You may want to use it as a front-end to a DB such as sqlite3

Maybe you should take a look at this project:
Moovida
It's a complete media center, open source, written in python that is easy to extend. I don't know if it will do exactly what you want out of the box but you can probably easily add the features you want.

Of course you can use almost any programming language for almost any task. But after noting that, it's also obvious that different languages are also differently well adapted for different tasks.
C/C++ are languages that are very "hardware friendly". Basically, the languages are just one abstraction level above assembler, with C's use of pointers etc. C++ is almost like a (semi-)portable object oriented assembler, if one wants to be funny. :) This makes C/C++ fast and good at talking to hardware.
But those same features become mis-features in other cases. The pointers make it possible to walk all over the memory and unless you are careful you will leak memory all over the place. So I would say (and now C people will get angry) that C/C++ in fact are directly inappropriate for what you want to do.
You want a language that are higher, does more things like memory management automatically and invisibly. There are many to choose from there, but without a doubt Python is eminently suited for this. Python has the last couple of years emerged as The New Cool Language to write these kind of softwares in, and much multimedia software such as Freevo and the already mentioned Moovida are written in Python.

Python on an Real-Time Operation System (RTOS)

I am planning to implement a small-scale data acquisition system on an RTOS platform. (Either on a QNX or an RT-Linux system.)
As far as I know, these jobs are performed using C / C++ to get the most out of the system. However I am curious to know and want to learn some experienced people's opinions before I blindly jump into the coding action whether it would be feasible and wiser to write everything in Python (from low-level instrument interfacing through a shiny graphical user interface). If not, mixing with timing-critical parts of the design with "C", or writing everything in C and not even putting a line of Python code.
Or at least wrapping the C code using Python to provide an easier access to the system.
Which way would you advise me to work on? I would be glad if you point some similar design cases and further readings as well.
Thank you
NOTE1: The reason of emphasizing on QNX is due to we already have a QNX 4.25 based data acquisition system (M300) for our atmospheric measurement experiments. This is a proprietary system and we can't access the internals of it. Looking further on QNX might be advantageous to us since 6.4 has a free academic licensing option, comes with Python 2.5, and a recent GCC version. I have never tested a RT-Linux system, don't know how comparable it to QNX in terms of stability and efficiency, but I know that all the members of Python habitat and non-Python tools (like Google Earth) that the new system could be developed on works most of the time out-of-the-box.

I've built several all-Python soft real-time (RT) systems, with primary cycle times from 1 ms to 1 second. There are some basic strategies and tactics I've learned along the way:
Use threading/multiprocessing only to offload non-RT work from the primary thread, where queues between threads are acceptable and cooperative threading is possible (no preemptive threads!).
Avoid the GIL. Which basically means not only avoiding threading, but also avoiding system calls to the greatest extent possible, especially during time-critical operations, unless they are non-blocking.
Use C modules when practical. Things (usually) go faster with C! But mainly if you don't have to write your own: Stay in Python unless there really is no alternative. Optimizing C module performance is a PITA, especially when translating across the Python-C interface becomes the most expensive part of the code.
Use Python accelerators to speed up your code. My first RT Python project greatly benefited from Psyco (yeah, I've been doing this a while). One reason I'm staying with Python 2.x today is PyPy: Things always go faster with LLVM!
Don't be afraid to busy-wait when critical timing is needed. Use time.sleep() to 'sneak up' on the desired time, then busy-wait during the last 1-10 ms. I've been able to get repeatable performance with self-timing on the order of 10 microseconds. Be sure your Python task is run at max OS priority.
Numpy ROCKS! If you are doing 'live' analytics or tons of statistics, there is NO way to get more work done faster and with less work (less code, fewer bugs) than by using Numpy. Not in any other language I know of, including C/C++. If the majority of your code consists of Numpy calls, you will be very, very fast. I can't wait for the Numpy port to PyPy to be completed!
Be aware of how and when Python does garbage collection. Monitor your memory use, and force GC when needed. Be sure to explicitly disable GC during time-critical operations. All of my RT Python systems run continuously, and Python loves to hog memory. Careful coding can eliminate almost all dynamic memory allocation, in which case you can completely disable GC!
Try to perform processing in batches to the greatest extent possible. Instead of processing data at the input rate, try to process data at the output rate, which is often much slower. Processing in batches also makes it more convenient to gather higher-level statistics.
Did I mention using PyPy? Well, it's worth mentioning twice.
There are many other benefits to using Python for RT development. For example, even if you are fairly certain your target language can't be Python, it can pay huge benefits to develop and debug a prototype in Python, then use it as both a template and test tool for the final system. I had been using Python to create quick prototypes of the "hard parts" of a system for years, and to create quick'n'dirty test GUIs. That's how my first RT Python system came into existence: The prototype (+Psyco) was fast enough, even with the test GUI running!
-BobC
Edit: Forgot to mention the cross-platform benefits: My code runs pretty much everywhere with a) no recompilation (or compiler dependencies, or need for cross-compilers), and b) almost no platform-dependent code (mainly for misc stuff like file handling and serial I/O). I can develop on Win7-x86 and deploy on Linux-ARM (or any other POSIX OS, all of which have Python these days). PyPy is primarily x86 for now, but the ARM port is evolving at an incredible pace.

I can't speak for every data acquisition setup out there, but most of them spend most of their "real-time operations" waiting for data to come in -- at least the ones I've worked on.
Then when the data does come in, you need to immediately record the event or respond to it, and then it's back to the waiting game. That's typically the most time-critical part of a data acquisition system. For that reason, I would generally say stick with C for the I/O parts of the data acquisition, but there aren't any particularly compelling reasons not to use Python on the non-time-critical portions.
If you have fairly loose requirements -- only needs millisecond precision, perhaps -- that adds some more weight to doing everything in Python. As far as development time goes, if you're already comfortable with Python, you would probably have a finished product significantly sooner if you were to do everything in Python and refactor only as bottlenecks appear. Doing the bulk of your work in Python will also make it easier to thoroughly test your code, and as a general rule of thumb, there will be fewer lines of code and thus less room for bugs.
If you need to specifically multi-task (not multi-thread), Stackless Python might be beneficial as well. It's like multi-threading, but the threads (or tasklets, in Stackless lingo) are not OS-level threads, but Python/application-level, so the overhead of switching between tasklets is significantly reduced. You can configure Stackless to multitask cooperatively or preemptively. The biggest downside is that blocking IO will generally block your entire set of tasklets. Anyway, considering that QNX is already a real-time system, it's hard to speculate whether Stackless would be worth using.
My vote would be to take the as-much-Python-as-possible route -- I see it as low cost and high benefit. If and when you do need to rewrite in C, you'll already have working code to start from.

Generally the reason advanced against using a high-level language in a real-time context is uncertainty -- when you run a routine one time it might take 100us; the next time you run the same routine it might decide to extend a hash table, calling malloc, then malloc asks the kernel for more memory, which could do anything from returning instantly to returning milliseconds later to returning seconds later to erroring, none of which is immediately apparent (or controllable) from the code. Whereas theoretically if you write in C (or even lower) you can prove that your critical paths will "always" (barring meteor strike) run in X time.

Our team have done some work combining multiple languages on QNX and had quite a lot of success with the approach. Using python can have a big impact on productivity, and tools like SWIG and ctypes make it really easy to optimize code and combine features from the different languages.
However, if you're writing anything time critical, it should almost certainly be written in c. Doing this means you avoid the implicit costs of an interpreted langauge like the GIL (Global Interpreter Lock), and contention on many small memory allocations. Both of these things can have a big impact on how your application performs.
Also python on QNX tends not to be 100% compatible with other distributions (ie/ there are sometimes modules missing).

One important note: Python for QNX is generally available only for x86.
I'm sure you can compile it for ppc and other archs, but that's not going to work out of the box.

Easier concurrency building blocks for Python?

It seems that Python standard library lacks various useful concurrency-related concepts such as atomic counter, executor and others that can be found in e.g. java.util.concurrent. Are there any external libraries that would provide easier building blocks for concurrent Python applications?

Kamaelia, as already mentioned, is aimed at making concurrency easier to work with in python.
Its original use case was network systems (which are a naturally concurrent) and developed with the viewpoint "How can we make these systems easier to develop and maintain".
Since then life has moved on and it is being used in a much wider variety of problem domains from desktop systems (like whiteboarding applications, database modelling, tools for teaching children to read and write) through to back end systems for websites (like stuff for transcoding & converting user contributed images and video for web playback in a variety of scenarios and SMS / text messaging applications.
The core concept is essentially the same idea as Unix pipelines - except instead of processes you can have python generators, threads, or processes - which are termed components. These communicate over inboxes and outboxes - as many as you like of each, rather than just stdin/stdout/stderr. Also rather than requiring serialised file interfaces, you pass between components fully fledged python objects. Also rather than being limited to pipelines, you can have arbitrary shapes - called graphlines.
You can find a full tutorial (video, slides, downloadable PDF booklet) here:
http://www.kamaelia.org/PragmaticConcurrency
Or the 5 minute version here (O'Reilly ignite talk):
http://yeoldeclue.com/cgi-bin/blog/blog.cgi?rm=viewpost&nodeid=1235690128
The focus on the library is pragmatic development, system safety and ease of maintenance though some effort has gone in recently towards adding some syntactic sugar. Like anything the developers (me and others :-) welcome feedback on improving it.
You can also find more information here:
- http://www.slideshare.net/kamaelian
Primarily, Kamaelia's core (Axon) was written to make my day job easier, and to wrap up best practice (message passing, software transactional memory) in a reusable fashion. I hope it makes your life easier too :-)

Although it may not be immediately obvious, itertools.count is indeed an atomic counter (the only operation on an instance x thereof, spelled next(x), is equivalent to an "atomic ++x" if C had such a concept;-). Edit: at least, this surely holds in CPython; I thought it was part of the Python standard definition but apparently IronPython and Jython disagree (not ensuring thread-safety of count.next in their current implementations) so I may well be wrong!
That is, suppose you currently have a data structure such as:
counters = dict.fromkeys(words_of_interest, 0)
...
if w in counters: counters[w] += 1
and your problem is that the latter increment is not atomic, so if two threads are at the same time dealing with the same word of interest the two increments might interfere (only one would "take", so the counter would be incremented only by one, not by two). Then:
counters = dict((w, itertools.count()) for w in words_of_interest)
...
if w in counters: next(counters[w])
will perform the same operations, but in an atomic way.
(There is unfortunately no obvious, documented way to "extract the current value of the counter", though in fact str(x) does return a string such as 'count(3)' from which the current value can be parsed out again;-).

Concurrency in Python (at least CPython) and Java are wildly different, at least in part because of the Global Interpreter Lock (GIL). In general, concurrency in Python is achieved not with threads, but processes. See multiprocessing for the "standard" concurrency module.
Also, check out "A Curious Course on Coroutines and Concurrency" for some concurrency techniques that were pretty new to me coming from Java. David Beazley (the author) is a Smart Guy™ when it comes to Python in general, and concurrency in particular.

kamaelia provides tools for abstracting concurrency to threads or process etc.

P-workers creates a "Job-Worker" abstraction over the python multiprocessing library. It simplifies concurrency with multiprocessing by starting "Workers" that have specific skills/attributes (defined functions), and providing a queue where they receive "Jobs" from. Its somewhat analogous to a thread pool, only with processes instead of threads. Therefore its better suited for a high number of CPU instructions. You can also use it to spawn multiple instances of a single application or even spawn "Workers" that have multiple threads.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.