Python on an Real-Time Operation System (RTOS)

Python on an Real-Time Operation System (RTOS) - python

I am planning to implement a small-scale data acquisition system on an RTOS platform. (Either on a QNX or an RT-Linux system.)
As far as I know, these jobs are performed using C / C++ to get the most out of the system. However I am curious to know and want to learn some experienced people's opinions before I blindly jump into the coding action whether it would be feasible and wiser to write everything in Python (from low-level instrument interfacing through a shiny graphical user interface). If not, mixing with timing-critical parts of the design with "C", or writing everything in C and not even putting a line of Python code.
Or at least wrapping the C code using Python to provide an easier access to the system.
Which way would you advise me to work on? I would be glad if you point some similar design cases and further readings as well.
Thank you
NOTE1: The reason of emphasizing on QNX is due to we already have a QNX 4.25 based data acquisition system (M300) for our atmospheric measurement experiments. This is a proprietary system and we can't access the internals of it. Looking further on QNX might be advantageous to us since 6.4 has a free academic licensing option, comes with Python 2.5, and a recent GCC version. I have never tested a RT-Linux system, don't know how comparable it to QNX in terms of stability and efficiency, but I know that all the members of Python habitat and non-Python tools (like Google Earth) that the new system could be developed on works most of the time out-of-the-box.

I've built several all-Python soft real-time (RT) systems, with primary cycle times from 1 ms to 1 second. There are some basic strategies and tactics I've learned along the way:
Use threading/multiprocessing only to offload non-RT work from the primary thread, where queues between threads are acceptable and cooperative threading is possible (no preemptive threads!).
Avoid the GIL. Which basically means not only avoiding threading, but also avoiding system calls to the greatest extent possible, especially during time-critical operations, unless they are non-blocking.
Use C modules when practical. Things (usually) go faster with C! But mainly if you don't have to write your own: Stay in Python unless there really is no alternative. Optimizing C module performance is a PITA, especially when translating across the Python-C interface becomes the most expensive part of the code.
Use Python accelerators to speed up your code. My first RT Python project greatly benefited from Psyco (yeah, I've been doing this a while). One reason I'm staying with Python 2.x today is PyPy: Things always go faster with LLVM!
Don't be afraid to busy-wait when critical timing is needed. Use time.sleep() to 'sneak up' on the desired time, then busy-wait during the last 1-10 ms. I've been able to get repeatable performance with self-timing on the order of 10 microseconds. Be sure your Python task is run at max OS priority.
Numpy ROCKS! If you are doing 'live' analytics or tons of statistics, there is NO way to get more work done faster and with less work (less code, fewer bugs) than by using Numpy. Not in any other language I know of, including C/C++. If the majority of your code consists of Numpy calls, you will be very, very fast. I can't wait for the Numpy port to PyPy to be completed!
Be aware of how and when Python does garbage collection. Monitor your memory use, and force GC when needed. Be sure to explicitly disable GC during time-critical operations. All of my RT Python systems run continuously, and Python loves to hog memory. Careful coding can eliminate almost all dynamic memory allocation, in which case you can completely disable GC!
Try to perform processing in batches to the greatest extent possible. Instead of processing data at the input rate, try to process data at the output rate, which is often much slower. Processing in batches also makes it more convenient to gather higher-level statistics.
Did I mention using PyPy? Well, it's worth mentioning twice.
There are many other benefits to using Python for RT development. For example, even if you are fairly certain your target language can't be Python, it can pay huge benefits to develop and debug a prototype in Python, then use it as both a template and test tool for the final system. I had been using Python to create quick prototypes of the "hard parts" of a system for years, and to create quick'n'dirty test GUIs. That's how my first RT Python system came into existence: The prototype (+Psyco) was fast enough, even with the test GUI running!
-BobC
Edit: Forgot to mention the cross-platform benefits: My code runs pretty much everywhere with a) no recompilation (or compiler dependencies, or need for cross-compilers), and b) almost no platform-dependent code (mainly for misc stuff like file handling and serial I/O). I can develop on Win7-x86 and deploy on Linux-ARM (or any other POSIX OS, all of which have Python these days). PyPy is primarily x86 for now, but the ARM port is evolving at an incredible pace.

I can't speak for every data acquisition setup out there, but most of them spend most of their "real-time operations" waiting for data to come in -- at least the ones I've worked on.
Then when the data does come in, you need to immediately record the event or respond to it, and then it's back to the waiting game. That's typically the most time-critical part of a data acquisition system. For that reason, I would generally say stick with C for the I/O parts of the data acquisition, but there aren't any particularly compelling reasons not to use Python on the non-time-critical portions.
If you have fairly loose requirements -- only needs millisecond precision, perhaps -- that adds some more weight to doing everything in Python. As far as development time goes, if you're already comfortable with Python, you would probably have a finished product significantly sooner if you were to do everything in Python and refactor only as bottlenecks appear. Doing the bulk of your work in Python will also make it easier to thoroughly test your code, and as a general rule of thumb, there will be fewer lines of code and thus less room for bugs.
If you need to specifically multi-task (not multi-thread), Stackless Python might be beneficial as well. It's like multi-threading, but the threads (or tasklets, in Stackless lingo) are not OS-level threads, but Python/application-level, so the overhead of switching between tasklets is significantly reduced. You can configure Stackless to multitask cooperatively or preemptively. The biggest downside is that blocking IO will generally block your entire set of tasklets. Anyway, considering that QNX is already a real-time system, it's hard to speculate whether Stackless would be worth using.
My vote would be to take the as-much-Python-as-possible route -- I see it as low cost and high benefit. If and when you do need to rewrite in C, you'll already have working code to start from.

Generally the reason advanced against using a high-level language in a real-time context is uncertainty -- when you run a routine one time it might take 100us; the next time you run the same routine it might decide to extend a hash table, calling malloc, then malloc asks the kernel for more memory, which could do anything from returning instantly to returning milliseconds later to returning seconds later to erroring, none of which is immediately apparent (or controllable) from the code. Whereas theoretically if you write in C (or even lower) you can prove that your critical paths will "always" (barring meteor strike) run in X time.

Our team have done some work combining multiple languages on QNX and had quite a lot of success with the approach. Using python can have a big impact on productivity, and tools like SWIG and ctypes make it really easy to optimize code and combine features from the different languages.
However, if you're writing anything time critical, it should almost certainly be written in c. Doing this means you avoid the implicit costs of an interpreted langauge like the GIL (Global Interpreter Lock), and contention on many small memory allocations. Both of these things can have a big impact on how your application performs.
Also python on QNX tends not to be 100% compatible with other distributions (ie/ there are sometimes modules missing).

One important note: Python for QNX is generally available only for x86.
I'm sure you can compile it for ppc and other archs, but that's not going to work out of the box.

Related

Why does python not lock only the mutable data? [duplicate]

I'm hoping someone can provide some insight as to what's fundamentally different about the Java Virtual Machine that allows it to implement threads nicely without the need for a Global Interpreter Lock (GIL), while Python necessitates such an evil.

Python (the language) doesn't need a GIL (which is why it can perfectly be implemented on JVM [Jython] and .NET [IronPython], and those implementations multithread freely). CPython (the popular implementation) has always used a GIL for ease of coding (esp. the coding of the garbage collection mechanisms) and of integration of non-thread-safe C-coded libraries (there used to be a ton of those around;-).
The Unladen Swallow project, among other ambitious goals, does plan a GIL-free virtual machine for Python -- to quote that site, "In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001)."

The JVM (at least hotspot) does have a similar concept to the "GIL", it's just much finer in its lock granularity, most of this comes from the GC's in hotspot which are more advanced.
In CPython it's one big lock (probably not that true, but good enough for arguments sake), in the JVM it's more spread about with different concepts depending on where it is used.
Take a look at, for example, vm/runtime/safepoint.hpp in the hotspot code, which is effectively a barrier. Once at a safepoint the entire VM has stopped with regard to java code, much like the python VM stops at the GIL.
In the Java world such VM pausing events are known as "stop-the-world", at these points only native code that is bound to certain criteria is free running, the rest of the VM has been stopped.
Also the lack of a coarse lock in java makes JNI much more difficult to write, as the JVM makes less guarantees about its environment for FFI calls, one of the things that cpython makes fairly easy (although not as easy as using ctypes).

There is a comment down below in this blog post http://www.grouplens.org/node/244 that hints at the reason why it was so easy dispense with a GIL for IronPython or Jython, it is that CPython uses reference counting whereas the other 2 VMs have garbage collectors.
The exact mechanics of why this is so I don't get, but it does sounds like a plausible reason.

In this link they have the following explanation:
... "Parts of the Interpreter aren't threadsafe, though mostly because making them all threadsafe by massive lock usage would slow single-threaded extremely (source). This seems to be related to the CPython garbage collector using reference counting (the JVM and CLR don't, and therefore don't need to lock/release a reference count every time). But even if someone thought of an acceptable solution and implemented it, third party libraries would still have the same problems."

Python lacks jit/aot and the time frame it was written at multithreaded processors didn't exist. Alternatively you could recompile everything in Julia lang which lacks GIL and gain some speed boost on your Python code. Also Jython kind of sucks it's slower than Cpython and Java. If you want to stick to Python consider using parallel plugins, you won't gain an instant speed boost but you can do parallel programming with the right plugin.

Linux bulk mtime

When handling large amounts of data, minimizing the number of database requests and doing bulk operations instead helps a lot in terms of performance.
Question: is it possible to retrieve the mtimes of 10,000 files at once/in bulk on Linux?
I hope to minimize system calls.
Couldn't find something here: http://www.gnu.org/software/libc/manual/html_node/index.html
PS: currently, I retrieve those filenames using python:os.walk.

Kinda hard to find sources claiming non-existence of something;)
Linux man page for fstat doesn't link to anything about bulk stat requests, so I assume it's nonexistent.
Besides, if you're accepting large performance penalties of using python, you shouldn't care about optimising one of the most optimised things in the OS. Python is inherently slow due to its dynamic high-level nature, so more effective optimisations include:
using faster algorithms in python
splitting the computational work via multiprocessing and network work via threads/coroutines
running your code under pypy (which offers jit compilation)
rewriting parts of your program in cython (statically typed python that is compiled into C)
rewriting parts of your program in C and connecting them back to python via extentions
just writing your program in C/C++/go/rust/any other compiled language
Quick rule of thumb: until your program doesn't consume 100% of all cores of CPU (or one core of your CPU for inherently unparallelizable tasks) you shouldn't consider optimising anything except your current code at the algorithm, networking and concurrency levels.
Part of a language design is speed of programming vs speed of execution balance. Achieving faster execution speed requires more input from a programmer and slows down development. If your program's slowest place is 10000 syscalls - then you should definitely be writing it in C, but I'm sure that you can find plenty of other things in your program to optimise.
I recommend using profiler (for example built in profile) to see the real hotspots in your code.

Parallel processing in Python à la Grand Central Dispatch?

Is there a way of doing parallel processing in Python using concepts similar to those of Apple's Grand Central Dispatch? Grand Central Dispatch looks, from the outset, like a nice way of handling parallel processing.
If Python does not have a mostly equivalent module, what are the fundamental concepts behind Grand Central Dispatch that could usefully be implemented in Python?
I don't know much about Grand Central Dispatch, hence this question: I would love to know whether Grand Central Dispatch uses paradigms that (1) are not yet available in Python, and/or (2) could be implemented in Python.

The main problem here is the compiler and OS part of GCD. In order to have GCD running, you need the compiler to understand Blocks. You could create something that works similarly when programming, but it wouldn't have the same performance at all. With GCD you can create and enqueue thousands of Blocks, and still, there will only be 2 or 4 threads executing this blocks. If you implement the high level functionality of Blocks without the compiler accepting them, the only way I see is to use threads for "simulating" Blocks. Then, using thousands of threads in a system with 2 to 4 CPU cores, would be an spectacular performance mess, due to context switching, and memory use.
Not only you need the proper compiler extension to support GCD, but also you need a proper OS extension to manage the GCD queues where Blocks are enqueued. You need the OS to behave in a way that it controls how many threads are executing, and when and how many of them to activate when CPU cores are available, for the program using GCD. With GCD, threads and queues are independent. The threads just grab Blocks from queues (light data structures), but from any of those. So it doesn't matter how many blocks are there, because they are only pieces of code and pointers stored somewhere in the main memory.
You simply can't implement all this low level features from python. And only implementing the high level "GCD way of programming", you are going to make veeeery slow programs, or maybe even impossible to execute in a personal computer.
So first, for instance Cython could support GCD, and the OS that you want to use too. Linux has an implementation called libdispatch, available for Devian. But it only implements the compiler portion, so the program starts as many threads as cores on the system minus one. So I think it is still not a good option. Someone should add Linux OS support for GCD, maybe as a kernel module.
Nothing to say about Windows. I really don't know.
So the first natural step, should be to add and test support of CGD in Cython, for Mac OS. From there, you could do a native Python library that internally uses de Cython GCD library, to offer blocks and queues to normal python programmers.
Anoder option could be the CPython project to embrace this, and the Python project to add blocks and queues as a native feature of python. That would be amazing XD

Python does not have an equivalent module, though twisted uses many of the same basic concepts (async APIs, callback-based). The Python multiprocessing module actually uses sub-processes rather than threads and is not particularly equivalent either. The best approach would probably be one similar to that taken by MacRuby, which is to create wrappers for the GCD APIs and use those. Unlike Python, of course, MacRuby was also designed not to have a GIL (Global Interpreter Lock) and this will reduce the effectiveness of multi-threading in Python as various interpret threads hit the GIL at different times. Not much to do about that other than redesign the language, I'm afraid.

Performance of Python worth the cost?

I'm looking at implementing a fuzzy logic controller based on either PyFuzzy (Python) or FFLL (C++) libraries.
I'd prefer to work with python but am unsure if the performance will be acceptable in the embedded environment it will work in (either ARM or embedded x86 proc both ~64Mbs of RAM).
The main concern is that response times are as fast as possible (an update rate of 5hz+ would be ideal >2Hz is required). The system would be reading from multiple (probably 5) sensors from an RS232 port and provide 2/3 outputs based on the results of the fuzzy evaluation.
Should I be concerned that Python will be too slow for this task?

In general, you shouldn't obsess over performance until you've actually seen it become a problem. Since we don't know the details of your app, we can't say how it'd perform if implemented in Python. And since you haven't implemented it yet, neither can you.
Implement the version you're most comfortable with, and can implement fastest, first. Then benchmark it. And if it is too slow, you have three options which should be done in order:
First, optimize your Python code
If that's not enough, write the most performance-critical functions in C/C++, and call that from your Python code
And finally, if you really need top performance, you might have to rewrite the whole thing in C++. But then at least you'll have a working prototype in Python, and you'll have a much clearer idea of how it should be implemented. You'll know what pitfalls to avoid, and you'll have an already correct implementation to test against and compare results to.

Python is very slow at handling large amounts of non-string data. For some operations, you may see that it is 1000 times slower than C/C++, so yes, you should investigate into this and do necessary benchmarks before you make time-critical algorithms in Python.
However, you can extend python with modules in C/C++ code, so that time-critical things are fast, while still being able to use python for the main code.

Make it work, then make it work fast.

If most of your runtime is spent in C libraries, the language you use to call these libraries isn't important. What language are your time-eating libraries written in ?

From your description, speed should not be much of a concern (and you can use C, cython, whatever you want to make it faster), but memory would be. For environments with 64 Mb max (where the OS and all should fit as well, right ?), I think there is a good chance that python may not be the right tool for target deployment.
If you have non trivial logic to handle, I would still prototype in python, though.

I never really measured the performance of pyfuzzy's examples, but as the new version 0.1.0 can read FCL files as FFLL does. Just describe your fuzzy system in this format, write some wrappers, and check the performance of both variants.
For reading FCL with pyfuzzy you need the antlr python runtime, but after reading you should be able to pickle the read object, so you don't need the antlr overhead on the target.

Why is (python|ruby) interpreted?

What are the technical reasons why languages like Python and Ruby are interpreted (out of the box) instead of compiled? It seems to me like it should not be too hard for people knowledgeable in this domain to make these languages not be interpreted like they are today, and we would see significant performance gains. So certainly I am missing something.

Several reasons:
faster development loop, write-test vs write-compile-link-test
easier to arrange for dynamic behavior (reflection, metaprogramming)
makes the whole system portable (just recompile the underlying C code and you are good to go on a new platform)
Think of what would happen if the system was not interpreted. Say you used translation-to-C as the mechanism. The compiled code would periodically have to check if it had been superseded by metaprogramming. A similar situation arises with eval()-type functions. In those cases, it would have to run the compiler again, an outrageously slow process, or it would have to also have the interpreter around at run-time anyway.
The only alternative here is a JIT compiler. These systems are highly complex and sophisticated and have even bigger run-time footprints than all the other alternatives. They start up very slowly, making them impractical for scripting. Ever seen a Java script? I haven't.
So, you have two choices:
all the disadvantages of both a compiler and an interpreter
just the disadvantages of an interpreter
It's not surprising that generally the primary implementation just goes with the second choice. It's quite possible that some day we may see secondary implementations like compilers appearing. Ruby 1.9 and Python have bytecode VM's; those are ½-way there. A compiler might target just non-dynamic code, or it might have various levels of language support declarable as options. But since such a thing can't be the primary implementation, it represents a lot of work for a very marginal benefit. Ruby already has 200,000 lines of C in it...
I suppose I should add that one can always add a compiled C (or, with some effort, any other language) extension. So, say you have a slow numerical operation. If you add, say Array#newOp with a C implementation then you get the speedup, the program stays in Ruby (or whatever) and your environment gets a new instance method. Everybody wins! So this reduces the need for a problematic secondary implementation.

Exactly like (in the typical implementation of) Java or C#, Python gets first compiled into some form of bytecode, depending on the implementation (CPython uses a specialized form of its own, Jython uses JVM just like a typical Java, IronPython uses CLR just like a typical C#, and so forth) -- that bytecode then gets further processed for execution by a virtual machine (AKA interpreter), which may also generate machine code "just in time" -- known as JIT -- if and when warranted (CLR and JVM implementations often do, CPython's own virtual machine typically doesn't but can be made to do so e.g. with psyco or Unladen Swallow).
JIT may pay for itself for sufficiently long-running programs (if memory's way cheaper than CPU cycles), but it may not (due to slower startup times and larger memory footprint), especially when the types also have to be inferred or specialized as part of the code generation. Generating machine code without type inference or specialization is easy if that's what you want, e.g. freeze does it for you, but it really doesn't present the advantages that "machine code fetishists" attribute to it. E.g., you get an executable binary of 1.5 to 2 MB in lieu of a tiny "hello world" .pyc -- not much point!-). That executable is stand-alone and distributable as such, but it will only work on a very specific narrow range of operating systems and CPU architectures, so the tradeoffs are quite iffy in most cases. And, the time it takes to prepare the executable is quite long indeed, so it would be a crazy choice to make that mode of operation the default one.

Merely replacing an interpreter with a compiler won't give you as big a performance boost as you might think for a language like Python. When most time is actually spend doing symbolic lookups of object members in dictionaries, it doesn't really matter if the call to the function performing such lookup is interpreted, or is native machine code - the difference, while not quite negligible, will be dwarfed by lookup overhead.
To really improve performance, you need optimizing compilers. And optimization techniques here are very different from what you have with C++, or even Java JIT - an optimizing compiler for a dynamically typed / duck typed language such as Python needs to do some very creative type inference (including probabilistic - i.e. "90% chance of it being T" and then generating efficient machine code for that case with a check/branch before it) and escape analysis. This is hard.

I think the biggest reason for the languages being interpreted is portability. As a programmer you can write code that will run in an interpreter not a specific OS. So your programs behave more uniformly across platforms (more so than compiled languages). Another advantage I can think of is it's easier to have a dynamic type system in an interpreted language. I think the creators of the language were thinking having a language where programmers can be more productive due to automatic memory management, dynamic type system and meta programming wins over any performance loss due to the language being interpreted. If you are concerned about performance you can always compile the language to native machine code employing a technique like JIT compilation.

Today, there is no longer a strong distinction between "compiled" and "interpreted" languages. Python is in fact compiled just as much as Java is, the only differences are:
The Python compiler is much faster than the Java compiler
Python automatically compiles source code as it is executed, there is no separate "compile" step required
Python bytecode is different from JVM bytecode
Python even has a function called compile() which is an interface to the compiler.
It sounds like the distinction you are making is between "dynamically typed" and "statically typed" languages. In dynamic languages such as Python, you can write code like:
def fn(x, y):
return x.foo(y)
Notice that the types of x and y are not specified. At runtime, this function will look at x to see whether it has a member function named foo, and if so will call it with y. If not, it will throw a runtime error that indicates no such function was found. This sort of runtime lookup is much easier to represent using an intermediate representation like bytecode, where a runtime VM does the lookup instead of having to generate machine code to do the lookup itself (or, call a function to do the lookup which is what the bytecode will do anyway).
Python has projects such as Psyco, PyPy, and Unladen Swallow that take various approaches to compiling Python object code into something closer to native code. There is active research in this area but there is not (as yet) a simple answer.

The effort required to create a good compiler to generate native code for a new language is staggering. Small research groups typically take 5 to 10 years (examples: SML/NJ, Haskell, Clean, Cecil, lcc, Objective Caml, MLton, and many others). And when the language in question requires type checking and other decisions to be made at run time, a compiler writer has to work much harder to get good native-code performance (for an excellent example, see work by Craig Chambers and later Urs Hoelzle on Self). The performance gains you might hope for are harder to realize than you might think. This phenomenon partly explains why so many dynamically typed languages are interpreted.
As noted, a decent interpreter is also instantly portable, while porting compilers to new machine architectures takes substantial effort (and is a problem I personally have been working on for over 20 years, with some time off for good behavior). So an interpreter is a way to reach a wide audience quickly.
Finally, although fast compilers and slow interpreters exist, it's usually easer to make the edit-translate-go cycle faster by using an interpreter. (For some nice examples of fast compilers see the aforementioned lcc as well as Ken Thompson's go compiler. For an example of a relatively slow interpreter see GHCi.

Well, isn't one of the strengths of these languages that they are so easily scriptable? They wouldn't be if they were compiled. And on the other hand, dynamic languages are easier to intereprete than to compile.

In a compiled language, the loop you get into when making software is
Make a change
Compile changes
Test changes
goto 1
Interpreted languages tend to be faster to make stuff in because you get to cut out step two of that process (and when you're dealing with a large system where compile times can be upwards of two minutes, step two can add a significant amount of time).
This isn't necessarily the reason python|ruby designers thought of, but keep in mind that "How efficiently does the machine run this?" is only half the software development problem.
It also seems like it would be easier to compile code in a language that's interpreted naturally than it would be to add an interpreter to a language that's compiled by default.

REPL. Don't knock it 'till you've tried it. :)

By design.
The authors wanted something where they can write scripts into.
Python gets compiled the first time it is executed though

Compiling Ruby at least is notoriously hard. I'm working on one, and as part of that I wrote a blog post enumerating some of the issues here.
Specifically, Ruby is suffering from a very unclear (i.e. non-existent) boundary between the "read" and "execute" phase of the program that makes it hard to compile efficiently. You could just emulate what the interpreter does, but then you're not going to see much speed up, so it wouldn't be worth the effort. If you want to compile it efficiently you then face a lot of additional complications to handle the extreme level of dynamism in Ruby.
The good news is that there are techniques for overcoming this. Self, Smalltalk and Lisp/Scheme's have dealt quite successfully with most of the same issues. But it takes time to sift through it and figure out how to make it work with Ruby. It also doesn't help that Ruby has a very convoluted grammar.

Raw compute performance is probably not a goal of most interpreted languages. Interpreted languages are typically more concerned about programmer productivity than raw speed. In most cases these languages are plenty fast enough for the tasks the languages were designed to tackle.
Given that, and that just about the only advantages of a compiler are type checking (difficult to do in a dynamic language) and speed, there's not much incentive to write compilers for most interpreted languages.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.